Principle:Haifengl Smile File Data Loading
Overview
File Data Loading is the foundational principle of loading tabular data from multiple file formats into a unified in-memory DataFrame structure. In the Smile Java machine learning library, data ingestion is designed to be format-agnostic: a single entry point can detect and parse CSV, JSON, Parquet, Apache Arrow (Feather), Apache Avro, ARFF (Weka), SAS7BDAT, and plain text files. The caller works with a consistent DataFrame regardless of the underlying serialization format.
This principle decouples the what (a rectangular, column-oriented table of typed values) from the how (the byte-level encoding on disk), enabling data scientists and engineers to swap file formats without rewriting downstream analysis code.
Theoretical Basis
The principle is grounded in the ETL (Extract-Transform-Load) paradigm widely used in data engineering:
- Extract -- Read raw bytes from a storage system (local file, URI, classpath resource).
- Transform -- Parse the format-specific encoding (delimiters, schemas, compression) into typed column vectors.
- Load -- Materialize the result as an in-memory
DataFrameready for analysis.
By abstracting the Extract phase behind a unified interface, Smile implements the Adapter pattern from software engineering: each file format has its own reader class (e.g., CSV, JSON, Parquet, Arrow, Arff, SAS, Avro) that implements the same contract -- reading a path and returning a DataFrame.
Formally, let be the set of supported file formats, and let be the reader for format . The unified reader is:
where is the format detection function that maps a file path to its format based on the file extension (e.g., .csv, .json, .parquet, .feather, .arff, .sas7bdat, .avro).
This ensures that the output type is invariant across formats:
Supported Formats
| Format | Extension(s) | Schema Source | Notes |
|---|---|---|---|
| CSV / TSV | .csv, .tsv, .dat, .txt |
Inferred from data or user-supplied StructType |
Uses Apache Commons CSV; configurable delimiter, quote, escape, comment, header |
| JSON | .json |
Inferred or user-supplied | Single-line (JSON Lines) or multi-line mode |
| Apache Parquet | .parquet |
Embedded in file | Columnar storage; efficient for large datasets |
| Apache Arrow / Feather | .feather |
Embedded in file | Cross-language columnar in-memory format |
| Apache Avro | .avro |
External schema file (JSON) | Row-oriented; requires separate schema |
| ARFF (Weka) | .arff |
Embedded in file header | Common in ML research; includes attribute metadata |
| SAS | .sas7bdat |
Embedded in file | SAS statistical software native format |
Design Principles
Convention over Configuration
The default Read.data(path) method uses file extension detection to select the appropriate reader. No explicit format specification is required for standard file extensions. When the extension is ambiguous or absent, an optional format parameter overrides the detection.
Schema Flexibility
For formats like CSV and JSON where schema is not embedded, Smile supports three modes:
- Automatic inference -- Column types are deduced from the data values.
- Explicit schema -- A
StructTypeobject defines column names, types, and measures. - Format string -- A compact string like
"delimiter=\t,header=true,comment=#"configures parsing behavior.
Uniform Output
All readers produce a DataFrame -- a two-dimensional, potentially heterogeneous tabular structure backed by typed ValueVector columns. This guarantees that downstream operations (inspection, selection, transformation, numerical conversion) work identically regardless of the source format.
Relationship to the Data Loading Pipeline
File Data Loading is the first stage of the Smile Data Loading Pipeline. The subsequent stages are:
- DataFrame Inspection -- Examine schema, dimensions, and metadata.
- Column Selection and Filtering -- Project and filter columns.
- Data Transformation -- Normalize, standardize, and scale features.
- Numerical Conversion -- Convert to arrays/matrices for ML algorithms.
Related Pages
Knowledge Sources
Metadata
| Property | Value |
|---|---|
| Domains | Data_Engineering, ETL |
| Workflow | Data_Loading_Pipeline |
| Stage | 1 of 5 |
| Last Updated | 2026-02-08 22:00 GMT |