Implementation:Haifengl Smile Read Data

Overview

The Read interface in Smile provides static methods for loading data from files into DataFrames. It is the primary entry point for all file-based data ingestion in the Smile library. The interface supports CSV, JSON, Parquet, Arrow (Feather), Avro, ARFF, SAS, and libsvm formats. Format detection is automatic based on file extension, with optional manual override.

API Summary

Method	Return Type	Description
`Read.data(String path)`	`DataFrame`	Auto-detect format from extension and read
`Read.data(String path, String format)`	`DataFrame`	Read with explicit format specification
`Read.csv(String path)`	`DataFrame`	Read CSV with default format
`Read.csv(String path, String format)`	`DataFrame`	Read CSV with format string (e.g., `"delimiter=\t,header=true"`)
`Read.csv(String path, CSVFormat format)`	`DataFrame`	Read CSV with Apache Commons CSVFormat
`Read.csv(String path, CSVFormat format, StructType schema)`	`DataFrame`	Read CSV with explicit schema
`Read.json(String path)`	`DataFrame`	Read JSON (single-line mode)
`Read.json(String path, JSON.Mode mode, StructType schema)`	`DataFrame`	Read JSON with mode and schema
`Read.parquet(String path)`	`DataFrame`	Read Apache Parquet file
`Read.arrow(String path)`	`DataFrame`	Read Apache Arrow / Feather file
`Read.arff(String path)`	`DataFrame`	Read Weka ARFF file
`Read.sas(String path)`	`DataFrame`	Read SAS7BDAT file
`Read.avro(String path, String schema)`	`DataFrame`	Read Apache Avro with schema file path
`Read.object(Path path)`	`Object`	Deserialize a Java object from file
`Read.libsvm(String path)`	`SparseDataset<Integer>`	Read libsvm sparse format

Source Location

Property	Value
File	`base/src/main/java/smile/io/Read.java`
Lines	L44-590
Package	`smile.io`
Repository	github.com/haifengl/smile

Import

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.type.StructType;
import org.apache.commons.csv.CSVFormat;

External Dependencies

Dependency	Usage
Apache Commons CSV	Parsing CSV/TSV files with configurable delimiters, quotes, and headers
Apache Parquet	Reading columnar Parquet files
Apache Arrow	Reading Arrow IPC / Feather files
Apache Avro	Reading Avro serialized files with external schema

Type: API Doc

Signature

public interface Read {
    // Auto-detection by file extension
    static DataFrame data(String path) throws Exception
    static DataFrame data(String path, String format) throws Exception

    // CSV readers
    static DataFrame csv(String path) throws IOException, URISyntaxException
    static DataFrame csv(String path, String format) throws IOException, URISyntaxException
    static DataFrame csv(String path, CSVFormat format) throws IOException, URISyntaxException
    static DataFrame csv(String path, CSVFormat format, StructType schema) throws IOException, URISyntaxException
    static DataFrame csv(Path path) throws IOException
    static DataFrame csv(Path path, CSVFormat format) throws IOException
    static DataFrame csv(Path path, CSVFormat format, StructType schema) throws IOException

    // JSON readers
    static DataFrame json(String path) throws IOException, URISyntaxException
    static DataFrame json(String path, JSON.Mode mode, StructType schema) throws IOException, URISyntaxException
    static DataFrame json(Path path) throws IOException
    static DataFrame json(Path path, JSON.Mode mode, StructType schema) throws IOException

    // Binary format readers
    static DataFrame parquet(String uri) throws Exception
    static DataFrame parquet(Path path) throws Exception
    static DataFrame arrow(String path) throws IOException, URISyntaxException
    static DataFrame arrow(Path path) throws IOException
    static DataFrame arff(String path) throws IOException, ParseException, URISyntaxException
    static DataFrame arff(Path path) throws IOException, ParseException
    static DataFrame sas(String path) throws IOException, URISyntaxException
    static DataFrame sas(Path path) throws IOException
    static DataFrame avro(String path, String schema) throws IOException, URISyntaxException
    static DataFrame avro(String path, InputStream schema) throws IOException, URISyntaxException
    static DataFrame avro(Path path, InputStream schema) throws IOException
    static DataFrame avro(Path path, Path schema) throws IOException

    // Object deserialization
    static Object object(Path path) throws IOException, ClassNotFoundException

    // Sparse format
    static SparseDataset<Integer> libsvm(String path) throws IOException, URISyntaxException
    static SparseDataset<Integer> libsvm(Path path) throws IOException
    static SparseDataset<Integer> libsvm(BufferedReader reader) throws IOException
}

Inputs and Outputs

Parameter	Type	Description
`path`	`String` or `Path`	File path or URI to the data file
`format`	`String`, `CSVFormat`, or `JSON.Mode`	Optional format specification
`schema`	`StructType` or `InputStream`	Optional data schema (column names and types)
Returns	`DataFrame`	Unified in-memory tabular data structure

Format Detection Logic

The Read.data() method extracts the file extension and dispatches to the appropriate reader:

// From Read.java -- format detection switch
String ext = path.substring(path.lastIndexOf(".") + 1);
switch (ext) {
    case "dat":
    case "txt":
    case "csv": return csv(path, format);
    case "arff": return arff(path);
    case "json": return json(path, mode, null);
    case "sas7bdat": return sas(path);
    case "avro": return avro(path, format);
    case "parquet": return parquet(path);
    case "feather": return arrow(path);
}

Usage Examples

Example 1: Auto-detect format and load

import smile.io.Read;
import smile.data.DataFrame;

// Auto-detect: .csv extension -> CSV reader
DataFrame iris = Read.data("data/iris.csv");
System.out.println(iris.schema());
System.out.println(iris);

// Auto-detect: .arff extension -> ARFF reader
DataFrame weather = Read.data("data/weather.arff");

// Auto-detect: .parquet extension -> Parquet reader
DataFrame sales = Read.data("data/sales.parquet");

Example 2: CSV with custom format string

import smile.io.Read;
import smile.data.DataFrame;

// Tab-separated file with header row and comment lines
DataFrame data = Read.csv("data/gene_expression.tsv",
    "delimiter=\t,header=true,comment=#");

System.out.println("Columns: " + String.join(", ", data.names()));
System.out.println("Rows: " + data.nrow());

Example 3: CSV with Apache Commons CSVFormat and explicit schema

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.type.DataTypes;
import smile.data.type.StructField;
import smile.data.type.StructType;
import org.apache.commons.csv.CSVFormat;

// Define explicit schema
StructType schema = new StructType(
    new StructField("sepal_length", DataTypes.DoubleType),
    new StructField("sepal_width", DataTypes.DoubleType),
    new StructField("petal_length", DataTypes.DoubleType),
    new StructField("petal_width", DataTypes.DoubleType),
    new StructField("species", DataTypes.StringType)
);

CSVFormat format = CSVFormat.Builder.create()
    .setHeader()
    .setSkipHeaderRecord(true)
    .build();

DataFrame iris = Read.csv("data/iris.csv", format, schema);
System.out.println(iris.head(5));

Example 4: JSON in multi-line mode

import smile.io.Read;
import smile.io.JSON;
import smile.data.DataFrame;

// Single-line JSON (one JSON object per line)
DataFrame logs = Read.json("data/access_logs.json");

// Multi-line JSON (entire file is one JSON array)
DataFrame config = Read.json("data/config.json",
    JSON.Mode.MULTI_LINE, null);

Example 5: Reading Avro with external schema

import smile.io.Read;
import smile.data.DataFrame;

// Avro file requires a separate schema file
DataFrame events = Read.avro("data/events.avro",
    "data/events.avsc");

System.out.println("Schema: " + events.schema());
System.out.println("Records: " + events.nrow());

Example 6: Loading a libsvm sparse dataset

import smile.io.Read;
import smile.data.SparseDataset;

// libsvm format: <label> <index>:<value> ...
SparseDataset<Integer> dataset = Read.libsvm("data/svmguide1.txt");
System.out.println("Samples: " + dataset.size());

Implementation Details

The Read interface delegates to format-specific reader classes:

Format	Reader Class	Key Implementation Detail
CSV	`smile.io.CSV`	Wraps Apache Commons CSV; infers types by scanning values
JSON	`smile.io.JSON`	Supports single-line (JSON Lines) and multi-line (JSON array) modes
Parquet	`smile.io.Parquet`	Uses Apache Parquet library; reads columnar data natively
Arrow	`smile.io.Arrow`	Uses Apache Arrow IPC reader; zero-copy when possible
ARFF	`smile.io.Arff`	Parses Weka ARFF header for attribute types; supports nominal, numeric, string, date
SAS	`smile.io.SAS`	Reads SAS7BDAT binary format
Avro	`smile.io.Avro`	Requires external Avro schema (JSON format)

All readers accept both String (URI/path) and java.nio.file.Path overloads. The String variants support classpath resources and URIs via the internal Input utility class.

Related Pages

Metadata

Property	Value
Type	API Doc
Language	Java
Library Version	5.2.0
Last Updated	2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment