Implementation:Haifengl Smile DataFrame Column Operations

Overview

The DataFrame Column Operations API provides methods on the DataFrame record for selecting, dropping, merging, concatenating, joining, adding, and factorizing columns. These methods return new DataFrame instances (immutable-style) or modify in place where documented, enabling composable data preparation pipelines.

API Summary

Method	Return Type	Description
`select(String... names)`	`DataFrame`	Returns a new DataFrame with only the named columns
`select(int... indices)`	`DataFrame`	Returns a new DataFrame with columns at the given indices
`drop(String... names)`	`DataFrame`	Returns a new DataFrame without the named columns
`drop(int... indices)`	`DataFrame`	Returns a new DataFrame without columns at the given indices
`merge(DataFrame... dataframes)`	`DataFrame`	Merges columns horizontally; renames duplicates with suffix
`concat(DataFrame... dataframes)`	`DataFrame`	Concatenates rows vertically; requires identical schemas
`join(DataFrame other)`	`DataFrame`	Inner join on row index; falls back to merge if no index
`add(ValueVector... vectors)`	`DataFrame`	Appends new columns (mutates this DataFrame)
`set(String name, ValueVector column)`	`DataFrame`	Sets/replaces a column by name (mutates this DataFrame)
`factorize(String... names)`	`DataFrame`	Returns a new DataFrame with string columns encoded as integers
`dropna()`	`DataFrame`	Returns a new DataFrame without rows containing null values
`fillna(double value)`	`DataFrame`	Fills null/NaN/Inf in numeric columns with the given value
`setIndex(String column)`	`DataFrame`	Sets a column as the row index and removes it from columns

Source Location

Property	Value
File	`base/src/main/java/smile/data/DataFrame.java`
Lines	L490-731 (core column operations)
Package	`smile.data`
Repository	github.com/haifengl/smile

Import

import smile.data.DataFrame;
import smile.data.vector.ValueVector;
import smile.data.vector.DoubleVector;
import smile.data.vector.IntVector;
import smile.data.vector.StringVector;

Type: API Doc

Signature

public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
        implements Iterable<Row>, Serializable {

    // Column selection
    public DataFrame select(String... names)
    public DataFrame select(int... indices)
    public DataFrame apply(String... names)  // alias for select

    // Column removal
    public DataFrame drop(String... names)
    public DataFrame drop(int... indices)

    // Column addition/replacement
    public DataFrame add(ValueVector... vectors)
    public DataFrame set(String name, ValueVector column)
    public DataFrame update(String name, ValueVector column)  // alias for set

    // Horizontal merge (by columns)
    public DataFrame merge(DataFrame... dataframes)
    public DataFrame join(DataFrame other)

    // Vertical concatenation (by rows)
    public DataFrame concat(DataFrame... dataframes)

    // Categorical encoding
    public DataFrame factorize(String... names)

    // Missing value handling
    public DataFrame dropna()
    public DataFrame fillna(double value)

    // Index management
    public DataFrame setIndex(String column)
    public DataFrame setIndex(Object[] index)
}

Inputs and Outputs

Method	Input	Output	Mutates?
`select(names)`	Column name strings	New DataFrame with selected columns	No
`select(indices)`	Column index integers	New DataFrame with selected columns	No
`drop(names)`	Column name strings	New DataFrame without those columns	No
`drop(indices)`	Column index integers	New DataFrame without those columns	No
`merge(dfs)`	One or more DataFrames	New DataFrame with combined columns	No
`concat(dfs)`	One or more DataFrames (same schema)	New DataFrame with combined rows	No
`join(other)`	Another DataFrame	New DataFrame via inner join on index	No
`add(vectors)`	ValueVector column(s)	This DataFrame (modified)	Yes
`set(name, col)`	Column name + ValueVector	This DataFrame (modified)	Yes
`factorize(names)`	Column name strings (or empty for all string columns)	New DataFrame with encoded columns	No

Usage Examples

Example 1: Selecting columns by name

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Select only the feature columns (exclude species label)
DataFrame features = iris.select("sepal_length", "sepal_width",
    "petal_length", "petal_width");

System.out.println("Feature columns: " + features.ncol());  // 4
System.out.println(features.head(5));

Example 2: Dropping identifier columns

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.csv("data/customers.csv",
    "delimiter=,,header=true");

// Remove non-predictive columns
DataFrame cleaned = data.drop("customer_id", "name", "email");
System.out.println("Remaining columns: " +
    String.join(", ", cleaned.names()));

Example 3: Factorizing categorical columns

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Convert "species" string column to integer encoding
DataFrame encoded = iris.factorize("species");

// The species column is now integer-valued with NominalScale
System.out.println(encoded.schema());
// species column: int with NominalScale(setosa, versicolor, virginica)

// Without arguments, factorize() converts ALL string columns
DataFrame allEncoded = iris.factorize();

Example 4: Merging DataFrames horizontally

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.vector.DoubleVector;

DataFrame data = Read.csv("data/features.csv",
    "delimiter=,,header=true");

// Compute a derived column and add it
double[] ratios = new double[data.nrow()];
for (int i = 0; i < data.nrow(); i++) {
    ratios[i] = data.getDouble(i, 0) / data.getDouble(i, 1);
}

// Create a single-column DataFrame and merge
DataFrame ratioFrame = new DataFrame(new DoubleVector("ratio", ratios));
DataFrame merged = data.merge(ratioFrame);

System.out.println("Merged columns: " +
    String.join(", ", merged.names()));

Example 5: Concatenating DataFrames vertically

import smile.io.Read;
import smile.data.DataFrame;

DataFrame train = Read.csv("data/train.csv",
    "delimiter=,,header=true");
DataFrame test = Read.csv("data/test.csv",
    "delimiter=,,header=true");

// Combine training and test sets (must have same schema)
DataFrame combined = train.concat(test);
System.out.println("Combined rows: " + combined.nrow());
// combined.nrow() == train.nrow() + test.nrow()

Example 6: Handling missing values

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.csv("data/messy.csv",
    "delimiter=,,header=true");

// Option A: Remove rows with any null values
DataFrame complete = data.dropna();
System.out.println("Complete cases: " + complete.nrow());

// Option B: Fill nulls/NaN/Inf with a constant
data.fillna(0.0);

Example 7: Selecting columns by index

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.csv("data/wide_table.csv",
    "delimiter=,,header=true");

// Select the first 5 columns by index
DataFrame subset = data.select(0, 1, 2, 3, 4);

// Drop column at index 0 (e.g., an ID column)
DataFrame noId = data.drop(0);

Implementation Details

select() and drop()

Both select() and drop() create new DataFrame instances by filtering the columns list. The row index is preserved. The select(String...) method uses schema.indexOf(name) to resolve column names to indices.

merge()

The merge() method checks that all DataFrames have the same row count, then combines their column lists. Duplicate column names are resolved by appending a suffix (_2, _3, etc.) using a Set to track existing names.

concat()

The concat() method verifies schema equality, streams all rows from all DataFrames, and materializes them into a new DataFrame via DataFrame.of(schema, rows). If all DataFrames have row indices, the indices are concatenated as well.

factorize()

The factorize() method collects distinct sorted string values from each target column, creates a NominalScale with those levels, and produces a new IntVector where each string is replaced by its integer index. Missing/null strings map to -1.

join()

The join() method performs an inner join using the row index. If either DataFrame has no index, it falls back to merge(). The join iterates over the left DataFrame's index values, looks them up in the right DataFrame's index, and collects matching rows.

Related Pages

Principle:Haifengl_Smile_Column_Selection_and_Filtering

Metadata

Property	Value
Type	API Doc
Language	Java
Library Version	5.2.0
Last Updated	2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment