Implementation:Haifengl Smile DataFrame Inspection API

Overview

The DataFrame Inspection API consists of methods on the DataFrame record class that expose the schema, dimensions, column metadata, and descriptive statistics of a loaded dataset. These methods are non-mutating -- they read structural information without modifying the underlying data.

The DataFrame in Smile is declared as a Java record with three components: schema (a StructType), columns (a List<ValueVector>), and index (an optional RowIndex).

API Summary

Method	Return Type	Description
`schema()`	`StructType`	Returns the full schema (column names, types, measures)
`names()`	`String[]`	Returns an array of column names
`dtypes()`	`DataType[]`	Returns an array of column data types
`measures()`	`Measure[]`	Returns the measurement level of each column
`size()`	`int`	Returns the number of rows (alias for `nrow()`)
`nrow()`	`int`	Returns the number of rows
`ncol()`	`int`	Returns the number of columns
`shape(int dim)`	`int`	Returns size of given dimension (0=rows, 1=columns)
`isEmpty()`	`boolean`	Returns true if the DataFrame has zero rows
`describe()`	`DataFrame`	Returns descriptive statistics for all columns
`head(int numRows)`	`String`	Returns string representation of first N rows
`tail(int numRows)`	`String`	Returns string representation of last N rows
`toString()`	`String`	Returns string representation of first 10 rows

Source Location

Property	Value
File	`base/src/main/java/smile/data/DataFrame.java`
Lines	L49-1047 (inspection-related methods)
Package	`smile.data`
Repository	github.com/haifengl/smile

Import

import smile.data.DataFrame;
import smile.data.type.StructType;
import smile.data.type.DataType;
import smile.data.measure.Measure;

Type: API Doc

Signature

public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
        implements Iterable<Row>, Serializable {

    // Schema inspection
    public String[] names()
    public DataType[] dtypes()
    public Measure[] measures()

    // Dimension inspection
    public int size()
    public int nrow()
    public int ncol()
    public int shape(int dim)
    public boolean isEmpty()

    // Descriptive statistics
    public DataFrame describe()

    // Display
    public String head(int numRows)
    public String tail(int numRows)
    public String toString(int from, int to, boolean truncate)
}

Inputs and Outputs

Method	Input	Output	Notes
`names()`	(none)	`String[]`	Delegates to `schema.names()`
`dtypes()`	(none)	`DataType[]`	Delegates to `schema.dtypes()`
`measures()`	(none)	`Measure[]`	Delegates to `schema.measures()`
`shape(dim)`	`int dim` (0 or 1)	`int`	dim=0 for rows, dim=1 for columns
`describe()`	(none)	`DataFrame`	Returns a DataFrame with columns: column, type, measure, count, mode, mean, std, min, 25%, 50%, 75%, max
`head(n)`	`int numRows`	`String`	Pretty-printed table of first N rows
`tail(n)`	`int numRows`	`String`	Pretty-printed table of last N rows

The describe() Method

The describe() method computes comprehensive summary statistics and returns them as a new DataFrame. The output DataFrame has the following columns:

Column	Type	Description
`column`	String	Column name
`type`	DataType	Data type of the column
`measure`	Measure	Measurement level (nominal, ordinal, etc.)
`count`	int	Number of non-null values
`mode`	Object	Most frequent value (for categorical); NaN for continuous
`mean`	double	Arithmetic mean (numeric columns only)
`std`	double	Sample standard deviation (numeric columns only)
`min`	double	Minimum value
`25%`	double	First quartile (Q1)
`50%`	double	Median (Q2)
`75%`	double	Third quartile (Q3)
`max`	double	Maximum value

For non-numeric columns (strings, objects), only count and mode are computed; the remaining statistics are NaN.

Usage Examples

Example 1: Basic schema and dimension inspection

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Column names
String[] names = iris.names();
// -> ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

// Data types
var dtypes = iris.dtypes();
for (int i = 0; i < names.length; i++) {
    System.out.printf("%s: %s%n", names[i], dtypes[i]);
}
// -> sepal_length: double
// -> sepal_width: double
// -> petal_length: double
// -> petal_width: double
// -> species: String

// Dimensions
System.out.println("Rows: " + iris.nrow());     // 150
System.out.println("Columns: " + iris.ncol());   // 5
System.out.println("Shape[0]: " + iris.shape(0)); // 150
System.out.println("Shape[1]: " + iris.shape(1)); // 5
System.out.println("Empty: " + iris.isEmpty());   // false

Example 2: Descriptive statistics

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Compute and display descriptive statistics
DataFrame stats = iris.describe();
System.out.println(stats);
// Outputs a table with column, type, measure, count, mode,
// mean, std, min, 25%, 50%, 75%, max for each column

Example 3: Previewing data with head() and tail()

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.data("data/housing.csv");

// Preview first 5 rows
System.out.println(data.head(5));

// Preview last 3 rows
System.out.println(data.tail(3));

// Default toString shows first 10 rows
System.out.println(data);

Example 4: Measurement level inspection

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.measure.Measure;

DataFrame data = Read.arff("data/weather.arff");

// ARFF files embed measurement metadata
Measure[] measures = data.measures();
String[] names = data.names();
for (int i = 0; i < names.length; i++) {
    System.out.printf("%s: %s%n", names[i],
        measures[i] != null ? measures[i] : "numeric");
}

Example 5: Conditional logic based on inspection

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.type.DataType;

DataFrame data = Read.data("data/mixed_types.csv");

// Find all numeric columns for downstream analysis
var dtypes = data.dtypes();
var names = data.names();
var numericCols = new java.util.ArrayList<String>();

for (int i = 0; i < names.length; i++) {
    if (dtypes[i].isFloating() || dtypes[i].isIntegral()) {
        numericCols.add(names[i]);
    }
}

System.out.println("Numeric columns: " + numericCols);
// Use these columns for feature extraction
DataFrame numeric = data.select(numericCols.toArray(new String[0]));

Implementation Details

The DataFrame is a Java record, so schema(), columns(), and index() are automatically generated accessor methods for the record components. The inspection methods delegate to these components:

names() delegates to schema.names()
dtypes() delegates to schema.dtypes()
measures() delegates to schema.measures()
size() and nrow() return columns.getFirst().size() -- the length of the first column vector
ncol() returns columns.size() -- the number of column vectors

The describe() method iterates over all columns, computing statistics appropriate for each column's data type. Categorical columns compute mode via frequency counting; numeric columns compute mean, standard deviation, and quartiles using the MathEx utility class.

Related Pages

Principle:Haifengl_Smile_DataFrame_Inspection

Metadata

Property	Value
Type	API Doc
Language	Java
Library Version	5.2.0
Last Updated	2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment