Principle:Haifengl Smile File Data Loading

Overview

File Data Loading is the foundational principle of loading tabular data from multiple file formats into a unified in-memory DataFrame structure. In the Smile Java machine learning library, data ingestion is designed to be format-agnostic: a single entry point can detect and parse CSV, JSON, Parquet, Apache Arrow (Feather), Apache Avro, ARFF (Weka), SAS7BDAT, and plain text files. The caller works with a consistent DataFrame regardless of the underlying serialization format.

This principle decouples the what (a rectangular, column-oriented table of typed values) from the how (the byte-level encoding on disk), enabling data scientists and engineers to swap file formats without rewriting downstream analysis code.

Theoretical Basis

The principle is grounded in the ETL (Extract-Transform-Load) paradigm widely used in data engineering:

Extract -- Read raw bytes from a storage system (local file, URI, classpath resource).
Transform -- Parse the format-specific encoding (delimiters, schemas, compression) into typed column vectors.
Load -- Materialize the result as an in-memory DataFrame ready for analysis.

By abstracting the Extract phase behind a unified interface, Smile implements the Adapter pattern from software engineering: each file format has its own reader class (e.g., CSV, JSON, Parquet, Arrow, Arff, SAS, Avro) that implements the same contract -- reading a path and returning a DataFrame.

Formally, let $F = {f_{1}, f_{2}, \dots, f_{k}}$ be the set of supported file formats, and let $R_{i} : Path \to DataFrame$ be the reader for format $f_{i}$ . The unified reader is:

$R (p) = R_{ϕ (p)} (p)$

where $ϕ (p)$ is the format detection function that maps a file path to its format based on the file extension (e.g., .csv, .json, .parquet, .feather, .arff, .sas7bdat, .avro).

This ensures that the output type is invariant across formats:

$\forall i, j \in {1, \dots, k} : type (R_{i} (p)) = type (R_{j} (q)) = DataFrame$

Supported Formats

Format	Extension(s)	Schema Source	Notes
CSV / TSV	`.csv`, `.tsv`, `.dat`, `.txt`	Inferred from data or user-supplied `StructType`	Uses Apache Commons CSV; configurable delimiter, quote, escape, comment, header
JSON	`.json`	Inferred or user-supplied	Single-line (JSON Lines) or multi-line mode
Apache Parquet	`.parquet`	Embedded in file	Columnar storage; efficient for large datasets
Apache Arrow / Feather	`.feather`	Embedded in file	Cross-language columnar in-memory format
Apache Avro	`.avro`	External schema file (JSON)	Row-oriented; requires separate schema
ARFF (Weka)	`.arff`	Embedded in file header	Common in ML research; includes attribute metadata
SAS	`.sas7bdat`	Embedded in file	SAS statistical software native format

Design Principles

Convention over Configuration

The default Read.data(path) method uses file extension detection to select the appropriate reader. No explicit format specification is required for standard file extensions. When the extension is ambiguous or absent, an optional format parameter overrides the detection.

Schema Flexibility

For formats like CSV and JSON where schema is not embedded, Smile supports three modes:

Automatic inference -- Column types are deduced from the data values.
Explicit schema -- A StructType object defines column names, types, and measures.
Format string -- A compact string like "delimiter=\t,header=true,comment=#" configures parsing behavior.

Uniform Output

All readers produce a DataFrame -- a two-dimensional, potentially heterogeneous tabular structure backed by typed ValueVector columns. This guarantees that downstream operations (inspection, selection, transformation, numerical conversion) work identically regardless of the source format.

Relationship to the Data Loading Pipeline

File Data Loading is the first stage of the Smile Data Loading Pipeline. The subsequent stages are:

DataFrame Inspection -- Examine schema, dimensions, and metadata.
Column Selection and Filtering -- Project and filter columns.
Data Transformation -- Normalize, standardize, and scale features.
Numerical Conversion -- Convert to arrays/matrices for ML algorithms.

Related Pages

Implementation:Haifengl_Smile_Read_Data

Knowledge Sources

Smile

Metadata

Property	Value
Domains	Data_Engineering, ETL
Workflow	Data_Loading_Pipeline
Stage	1 of 5
Last Updated	2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment