Implementation:Haifengl Smile DataFrame Inspection API
Overview
The DataFrame Inspection API consists of methods on the DataFrame record class that expose the schema, dimensions, column metadata, and descriptive statistics of a loaded dataset. These methods are non-mutating -- they read structural information without modifying the underlying data.
The DataFrame in Smile is declared as a Java record with three components: schema (a StructType), columns (a List<ValueVector>), and index (an optional RowIndex).
API Summary
| Method | Return Type | Description |
|---|---|---|
schema() |
StructType |
Returns the full schema (column names, types, measures) |
names() |
String[] |
Returns an array of column names |
dtypes() |
DataType[] |
Returns an array of column data types |
measures() |
Measure[] |
Returns the measurement level of each column |
size() |
int |
Returns the number of rows (alias for nrow())
|
nrow() |
int |
Returns the number of rows |
ncol() |
int |
Returns the number of columns |
shape(int dim) |
int |
Returns size of given dimension (0=rows, 1=columns) |
isEmpty() |
boolean |
Returns true if the DataFrame has zero rows |
describe() |
DataFrame |
Returns descriptive statistics for all columns |
head(int numRows) |
String |
Returns string representation of first N rows |
tail(int numRows) |
String |
Returns string representation of last N rows |
toString() |
String |
Returns string representation of first 10 rows |
Source Location
| Property | Value |
|---|---|
| File | base/src/main/java/smile/data/DataFrame.java
|
| Lines | L49-1047 (inspection-related methods) |
| Package | smile.data
|
| Repository | github.com/haifengl/smile |
Import
import smile.data.DataFrame;
import smile.data.type.StructType;
import smile.data.type.DataType;
import smile.data.measure.Measure;
Type: API Doc
Signature
public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
implements Iterable<Row>, Serializable {
// Schema inspection
public String[] names()
public DataType[] dtypes()
public Measure[] measures()
// Dimension inspection
public int size()
public int nrow()
public int ncol()
public int shape(int dim)
public boolean isEmpty()
// Descriptive statistics
public DataFrame describe()
// Display
public String head(int numRows)
public String tail(int numRows)
public String toString(int from, int to, boolean truncate)
}
Inputs and Outputs
| Method | Input | Output | Notes |
|---|---|---|---|
names() |
(none) | String[] |
Delegates to schema.names()
|
dtypes() |
(none) | DataType[] |
Delegates to schema.dtypes()
|
measures() |
(none) | Measure[] |
Delegates to schema.measures()
|
shape(dim) |
int dim (0 or 1) |
int |
dim=0 for rows, dim=1 for columns |
describe() |
(none) | DataFrame |
Returns a DataFrame with columns: column, type, measure, count, mode, mean, std, min, 25%, 50%, 75%, max |
head(n) |
int numRows |
String |
Pretty-printed table of first N rows |
tail(n) |
int numRows |
String |
Pretty-printed table of last N rows |
The describe() Method
The describe() method computes comprehensive summary statistics and returns them as a new DataFrame. The output DataFrame has the following columns:
| Column | Type | Description |
|---|---|---|
column |
String | Column name |
type |
DataType | Data type of the column |
measure |
Measure | Measurement level (nominal, ordinal, etc.) |
count |
int | Number of non-null values |
mode |
Object | Most frequent value (for categorical); NaN for continuous |
mean |
double | Arithmetic mean (numeric columns only) |
std |
double | Sample standard deviation (numeric columns only) |
min |
double | Minimum value |
25% |
double | First quartile (Q1) |
50% |
double | Median (Q2) |
75% |
double | Third quartile (Q3) |
max |
double | Maximum value |
For non-numeric columns (strings, objects), only count and mode are computed; the remaining statistics are NaN.
Usage Examples
Example 1: Basic schema and dimension inspection
import smile.io.Read;
import smile.data.DataFrame;
DataFrame iris = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Column names
String[] names = iris.names();
// -> ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
// Data types
var dtypes = iris.dtypes();
for (int i = 0; i < names.length; i++) {
System.out.printf("%s: %s%n", names[i], dtypes[i]);
}
// -> sepal_length: double
// -> sepal_width: double
// -> petal_length: double
// -> petal_width: double
// -> species: String
// Dimensions
System.out.println("Rows: " + iris.nrow()); // 150
System.out.println("Columns: " + iris.ncol()); // 5
System.out.println("Shape[0]: " + iris.shape(0)); // 150
System.out.println("Shape[1]: " + iris.shape(1)); // 5
System.out.println("Empty: " + iris.isEmpty()); // false
Example 2: Descriptive statistics
import smile.io.Read;
import smile.data.DataFrame;
DataFrame iris = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Compute and display descriptive statistics
DataFrame stats = iris.describe();
System.out.println(stats);
// Outputs a table with column, type, measure, count, mode,
// mean, std, min, 25%, 50%, 75%, max for each column
Example 3: Previewing data with head() and tail()
import smile.io.Read;
import smile.data.DataFrame;
DataFrame data = Read.data("data/housing.csv");
// Preview first 5 rows
System.out.println(data.head(5));
// Preview last 3 rows
System.out.println(data.tail(3));
// Default toString shows first 10 rows
System.out.println(data);
Example 4: Measurement level inspection
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.measure.Measure;
DataFrame data = Read.arff("data/weather.arff");
// ARFF files embed measurement metadata
Measure[] measures = data.measures();
String[] names = data.names();
for (int i = 0; i < names.length; i++) {
System.out.printf("%s: %s%n", names[i],
measures[i] != null ? measures[i] : "numeric");
}
Example 5: Conditional logic based on inspection
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.type.DataType;
DataFrame data = Read.data("data/mixed_types.csv");
// Find all numeric columns for downstream analysis
var dtypes = data.dtypes();
var names = data.names();
var numericCols = new java.util.ArrayList<String>();
for (int i = 0; i < names.length; i++) {
if (dtypes[i].isFloating() || dtypes[i].isIntegral()) {
numericCols.add(names[i]);
}
}
System.out.println("Numeric columns: " + numericCols);
// Use these columns for feature extraction
DataFrame numeric = data.select(numericCols.toArray(new String[0]));
Implementation Details
The DataFrame is a Java record, so schema(), columns(), and index() are automatically generated accessor methods for the record components. The inspection methods delegate to these components:
names()delegates toschema.names()dtypes()delegates toschema.dtypes()measures()delegates toschema.measures()size()andnrow()returncolumns.getFirst().size()-- the length of the first column vectorncol()returnscolumns.size()-- the number of column vectors
The describe() method iterates over all columns, computing statistics appropriate for each column's data type. Categorical columns compute mode via frequency counting; numeric columns compute mean, standard deviation, and quartiles using the MathEx utility class.
Related Pages
Metadata
| Property | Value |
|---|---|
| Type | API Doc |
| Language | Java |
| Library Version | 5.2.0 |
| Last Updated | 2026-02-08 22:00 GMT |