Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haifengl Smile DataFrame Inspection API

From Leeroopedia


Overview

The DataFrame Inspection API consists of methods on the DataFrame record class that expose the schema, dimensions, column metadata, and descriptive statistics of a loaded dataset. These methods are non-mutating -- they read structural information without modifying the underlying data.

The DataFrame in Smile is declared as a Java record with three components: schema (a StructType), columns (a List<ValueVector>), and index (an optional RowIndex).

API Summary

Method Return Type Description
schema() StructType Returns the full schema (column names, types, measures)
names() String[] Returns an array of column names
dtypes() DataType[] Returns an array of column data types
measures() Measure[] Returns the measurement level of each column
size() int Returns the number of rows (alias for nrow())
nrow() int Returns the number of rows
ncol() int Returns the number of columns
shape(int dim) int Returns size of given dimension (0=rows, 1=columns)
isEmpty() boolean Returns true if the DataFrame has zero rows
describe() DataFrame Returns descriptive statistics for all columns
head(int numRows) String Returns string representation of first N rows
tail(int numRows) String Returns string representation of last N rows
toString() String Returns string representation of first 10 rows

Source Location

Property Value
File base/src/main/java/smile/data/DataFrame.java
Lines L49-1047 (inspection-related methods)
Package smile.data
Repository github.com/haifengl/smile

Import

import smile.data.DataFrame;
import smile.data.type.StructType;
import smile.data.type.DataType;
import smile.data.measure.Measure;

Type: API Doc

Signature

public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
        implements Iterable<Row>, Serializable {

    // Schema inspection
    public String[] names()
    public DataType[] dtypes()
    public Measure[] measures()

    // Dimension inspection
    public int size()
    public int nrow()
    public int ncol()
    public int shape(int dim)
    public boolean isEmpty()

    // Descriptive statistics
    public DataFrame describe()

    // Display
    public String head(int numRows)
    public String tail(int numRows)
    public String toString(int from, int to, boolean truncate)
}

Inputs and Outputs

Method Input Output Notes
names() (none) String[] Delegates to schema.names()
dtypes() (none) DataType[] Delegates to schema.dtypes()
measures() (none) Measure[] Delegates to schema.measures()
shape(dim) int dim (0 or 1) int dim=0 for rows, dim=1 for columns
describe() (none) DataFrame Returns a DataFrame with columns: column, type, measure, count, mode, mean, std, min, 25%, 50%, 75%, max
head(n) int numRows String Pretty-printed table of first N rows
tail(n) int numRows String Pretty-printed table of last N rows

The describe() Method

The describe() method computes comprehensive summary statistics and returns them as a new DataFrame. The output DataFrame has the following columns:

Column Type Description
column String Column name
type DataType Data type of the column
measure Measure Measurement level (nominal, ordinal, etc.)
count int Number of non-null values
mode Object Most frequent value (for categorical); NaN for continuous
mean double Arithmetic mean (numeric columns only)
std double Sample standard deviation (numeric columns only)
min double Minimum value
25% double First quartile (Q1)
50% double Median (Q2)
75% double Third quartile (Q3)
max double Maximum value

For non-numeric columns (strings, objects), only count and mode are computed; the remaining statistics are NaN.

Usage Examples

Example 1: Basic schema and dimension inspection

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Column names
String[] names = iris.names();
// -> ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

// Data types
var dtypes = iris.dtypes();
for (int i = 0; i < names.length; i++) {
    System.out.printf("%s: %s%n", names[i], dtypes[i]);
}
// -> sepal_length: double
// -> sepal_width: double
// -> petal_length: double
// -> petal_width: double
// -> species: String

// Dimensions
System.out.println("Rows: " + iris.nrow());     // 150
System.out.println("Columns: " + iris.ncol());   // 5
System.out.println("Shape[0]: " + iris.shape(0)); // 150
System.out.println("Shape[1]: " + iris.shape(1)); // 5
System.out.println("Empty: " + iris.isEmpty());   // false

Example 2: Descriptive statistics

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Compute and display descriptive statistics
DataFrame stats = iris.describe();
System.out.println(stats);
// Outputs a table with column, type, measure, count, mode,
// mean, std, min, 25%, 50%, 75%, max for each column

Example 3: Previewing data with head() and tail()

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.data("data/housing.csv");

// Preview first 5 rows
System.out.println(data.head(5));

// Preview last 3 rows
System.out.println(data.tail(3));

// Default toString shows first 10 rows
System.out.println(data);

Example 4: Measurement level inspection

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.measure.Measure;

DataFrame data = Read.arff("data/weather.arff");

// ARFF files embed measurement metadata
Measure[] measures = data.measures();
String[] names = data.names();
for (int i = 0; i < names.length; i++) {
    System.out.printf("%s: %s%n", names[i],
        measures[i] != null ? measures[i] : "numeric");
}

Example 5: Conditional logic based on inspection

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.type.DataType;

DataFrame data = Read.data("data/mixed_types.csv");

// Find all numeric columns for downstream analysis
var dtypes = data.dtypes();
var names = data.names();
var numericCols = new java.util.ArrayList<String>();

for (int i = 0; i < names.length; i++) {
    if (dtypes[i].isFloating() || dtypes[i].isIntegral()) {
        numericCols.add(names[i]);
    }
}

System.out.println("Numeric columns: " + numericCols);
// Use these columns for feature extraction
DataFrame numeric = data.select(numericCols.toArray(new String[0]));

Implementation Details

The DataFrame is a Java record, so schema(), columns(), and index() are automatically generated accessor methods for the record components. The inspection methods delegate to these components:

  • names() delegates to schema.names()
  • dtypes() delegates to schema.dtypes()
  • measures() delegates to schema.measures()
  • size() and nrow() return columns.getFirst().size() -- the length of the first column vector
  • ncol() returns columns.size() -- the number of column vectors

The describe() method iterates over all columns, computing statistics appropriate for each column's data type. Categorical columns compute mode via frequency counting; numeric columns compute mean, standard deviation, and quartiles using the MathEx utility class.

Related Pages

Metadata

Property Value
Type API Doc
Language Java
Library Version 5.2.0
Last Updated 2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment