Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haifengl Smile File Data Loading

From Leeroopedia


Overview

File Data Loading is the foundational principle of loading tabular data from multiple file formats into a unified in-memory DataFrame structure. In the Smile Java machine learning library, data ingestion is designed to be format-agnostic: a single entry point can detect and parse CSV, JSON, Parquet, Apache Arrow (Feather), Apache Avro, ARFF (Weka), SAS7BDAT, and plain text files. The caller works with a consistent DataFrame regardless of the underlying serialization format.

This principle decouples the what (a rectangular, column-oriented table of typed values) from the how (the byte-level encoding on disk), enabling data scientists and engineers to swap file formats without rewriting downstream analysis code.

Theoretical Basis

The principle is grounded in the ETL (Extract-Transform-Load) paradigm widely used in data engineering:

  1. Extract -- Read raw bytes from a storage system (local file, URI, classpath resource).
  2. Transform -- Parse the format-specific encoding (delimiters, schemas, compression) into typed column vectors.
  3. Load -- Materialize the result as an in-memory DataFrame ready for analysis.

By abstracting the Extract phase behind a unified interface, Smile implements the Adapter pattern from software engineering: each file format has its own reader class (e.g., CSV, JSON, Parquet, Arrow, Arff, SAS, Avro) that implements the same contract -- reading a path and returning a DataFrame.

Formally, let F={f1,f2,,fk} be the set of supported file formats, and let Ri:PathDataFrame be the reader for format fi. The unified reader is:

R(p)=Rϕ(p)(p)

where ϕ(p) is the format detection function that maps a file path to its format based on the file extension (e.g., .csv, .json, .parquet, .feather, .arff, .sas7bdat, .avro).

This ensures that the output type is invariant across formats:

i,j{1,,k}:type(Ri(p))=type(Rj(q))=DataFrame

Supported Formats

Format Extension(s) Schema Source Notes
CSV / TSV .csv, .tsv, .dat, .txt Inferred from data or user-supplied StructType Uses Apache Commons CSV; configurable delimiter, quote, escape, comment, header
JSON .json Inferred or user-supplied Single-line (JSON Lines) or multi-line mode
Apache Parquet .parquet Embedded in file Columnar storage; efficient for large datasets
Apache Arrow / Feather .feather Embedded in file Cross-language columnar in-memory format
Apache Avro .avro External schema file (JSON) Row-oriented; requires separate schema
ARFF (Weka) .arff Embedded in file header Common in ML research; includes attribute metadata
SAS .sas7bdat Embedded in file SAS statistical software native format

Design Principles

Convention over Configuration

The default Read.data(path) method uses file extension detection to select the appropriate reader. No explicit format specification is required for standard file extensions. When the extension is ambiguous or absent, an optional format parameter overrides the detection.

Schema Flexibility

For formats like CSV and JSON where schema is not embedded, Smile supports three modes:

  • Automatic inference -- Column types are deduced from the data values.
  • Explicit schema -- A StructType object defines column names, types, and measures.
  • Format string -- A compact string like "delimiter=\t,header=true,comment=#" configures parsing behavior.

Uniform Output

All readers produce a DataFrame -- a two-dimensional, potentially heterogeneous tabular structure backed by typed ValueVector columns. This guarantees that downstream operations (inspection, selection, transformation, numerical conversion) work identically regardless of the source format.

Relationship to the Data Loading Pipeline

File Data Loading is the first stage of the Smile Data Loading Pipeline. The subsequent stages are:

  1. DataFrame Inspection -- Examine schema, dimensions, and metadata.
  2. Column Selection and Filtering -- Project and filter columns.
  3. Data Transformation -- Normalize, standardize, and scale features.
  4. Numerical Conversion -- Convert to arrays/matrices for ML algorithms.

Related Pages

Knowledge Sources

Metadata

Property Value
Domains Data_Engineering, ETL
Workflow Data_Loading_Pipeline
Stage 1 of 5
Last Updated 2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment