Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haifengl Smile DataFrame Column Operations

From Leeroopedia


Overview

The DataFrame Column Operations API provides methods on the DataFrame record for selecting, dropping, merging, concatenating, joining, adding, and factorizing columns. These methods return new DataFrame instances (immutable-style) or modify in place where documented, enabling composable data preparation pipelines.

API Summary

Method Return Type Description
select(String... names) DataFrame Returns a new DataFrame with only the named columns
select(int... indices) DataFrame Returns a new DataFrame with columns at the given indices
drop(String... names) DataFrame Returns a new DataFrame without the named columns
drop(int... indices) DataFrame Returns a new DataFrame without columns at the given indices
merge(DataFrame... dataframes) DataFrame Merges columns horizontally; renames duplicates with suffix
concat(DataFrame... dataframes) DataFrame Concatenates rows vertically; requires identical schemas
join(DataFrame other) DataFrame Inner join on row index; falls back to merge if no index
add(ValueVector... vectors) DataFrame Appends new columns (mutates this DataFrame)
set(String name, ValueVector column) DataFrame Sets/replaces a column by name (mutates this DataFrame)
factorize(String... names) DataFrame Returns a new DataFrame with string columns encoded as integers
dropna() DataFrame Returns a new DataFrame without rows containing null values
fillna(double value) DataFrame Fills null/NaN/Inf in numeric columns with the given value
setIndex(String column) DataFrame Sets a column as the row index and removes it from columns

Source Location

Property Value
File base/src/main/java/smile/data/DataFrame.java
Lines L490-731 (core column operations)
Package smile.data
Repository github.com/haifengl/smile

Import

import smile.data.DataFrame;
import smile.data.vector.ValueVector;
import smile.data.vector.DoubleVector;
import smile.data.vector.IntVector;
import smile.data.vector.StringVector;

Type: API Doc

Signature

public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
        implements Iterable<Row>, Serializable {

    // Column selection
    public DataFrame select(String... names)
    public DataFrame select(int... indices)
    public DataFrame apply(String... names)  // alias for select

    // Column removal
    public DataFrame drop(String... names)
    public DataFrame drop(int... indices)

    // Column addition/replacement
    public DataFrame add(ValueVector... vectors)
    public DataFrame set(String name, ValueVector column)
    public DataFrame update(String name, ValueVector column)  // alias for set

    // Horizontal merge (by columns)
    public DataFrame merge(DataFrame... dataframes)
    public DataFrame join(DataFrame other)

    // Vertical concatenation (by rows)
    public DataFrame concat(DataFrame... dataframes)

    // Categorical encoding
    public DataFrame factorize(String... names)

    // Missing value handling
    public DataFrame dropna()
    public DataFrame fillna(double value)

    // Index management
    public DataFrame setIndex(String column)
    public DataFrame setIndex(Object[] index)
}

Inputs and Outputs

Method Input Output Mutates?
select(names) Column name strings New DataFrame with selected columns No
select(indices) Column index integers New DataFrame with selected columns No
drop(names) Column name strings New DataFrame without those columns No
drop(indices) Column index integers New DataFrame without those columns No
merge(dfs) One or more DataFrames New DataFrame with combined columns No
concat(dfs) One or more DataFrames (same schema) New DataFrame with combined rows No
join(other) Another DataFrame New DataFrame via inner join on index No
add(vectors) ValueVector column(s) This DataFrame (modified) Yes
set(name, col) Column name + ValueVector This DataFrame (modified) Yes
factorize(names) Column name strings (or empty for all string columns) New DataFrame with encoded columns No

Usage Examples

Example 1: Selecting columns by name

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Select only the feature columns (exclude species label)
DataFrame features = iris.select("sepal_length", "sepal_width",
    "petal_length", "petal_width");

System.out.println("Feature columns: " + features.ncol());  // 4
System.out.println(features.head(5));

Example 2: Dropping identifier columns

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.csv("data/customers.csv",
    "delimiter=,,header=true");

// Remove non-predictive columns
DataFrame cleaned = data.drop("customer_id", "name", "email");
System.out.println("Remaining columns: " +
    String.join(", ", cleaned.names()));

Example 3: Factorizing categorical columns

import smile.io.Read;
import smile.data.DataFrame;

DataFrame iris = Read.csv("data/iris.csv",
    "delimiter=,,header=true");

// Convert "species" string column to integer encoding
DataFrame encoded = iris.factorize("species");

// The species column is now integer-valued with NominalScale
System.out.println(encoded.schema());
// species column: int with NominalScale(setosa, versicolor, virginica)

// Without arguments, factorize() converts ALL string columns
DataFrame allEncoded = iris.factorize();

Example 4: Merging DataFrames horizontally

import smile.io.Read;
import smile.data.DataFrame;
import smile.data.vector.DoubleVector;

DataFrame data = Read.csv("data/features.csv",
    "delimiter=,,header=true");

// Compute a derived column and add it
double[] ratios = new double[data.nrow()];
for (int i = 0; i < data.nrow(); i++) {
    ratios[i] = data.getDouble(i, 0) / data.getDouble(i, 1);
}

// Create a single-column DataFrame and merge
DataFrame ratioFrame = new DataFrame(new DoubleVector("ratio", ratios));
DataFrame merged = data.merge(ratioFrame);

System.out.println("Merged columns: " +
    String.join(", ", merged.names()));

Example 5: Concatenating DataFrames vertically

import smile.io.Read;
import smile.data.DataFrame;

DataFrame train = Read.csv("data/train.csv",
    "delimiter=,,header=true");
DataFrame test = Read.csv("data/test.csv",
    "delimiter=,,header=true");

// Combine training and test sets (must have same schema)
DataFrame combined = train.concat(test);
System.out.println("Combined rows: " + combined.nrow());
// combined.nrow() == train.nrow() + test.nrow()

Example 6: Handling missing values

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.csv("data/messy.csv",
    "delimiter=,,header=true");

// Option A: Remove rows with any null values
DataFrame complete = data.dropna();
System.out.println("Complete cases: " + complete.nrow());

// Option B: Fill nulls/NaN/Inf with a constant
data.fillna(0.0);

Example 7: Selecting columns by index

import smile.io.Read;
import smile.data.DataFrame;

DataFrame data = Read.csv("data/wide_table.csv",
    "delimiter=,,header=true");

// Select the first 5 columns by index
DataFrame subset = data.select(0, 1, 2, 3, 4);

// Drop column at index 0 (e.g., an ID column)
DataFrame noId = data.drop(0);

Implementation Details

select() and drop()

Both select() and drop() create new DataFrame instances by filtering the columns list. The row index is preserved. The select(String...) method uses schema.indexOf(name) to resolve column names to indices.

merge()

The merge() method checks that all DataFrames have the same row count, then combines their column lists. Duplicate column names are resolved by appending a suffix (_2, _3, etc.) using a Set to track existing names.

concat()

The concat() method verifies schema equality, streams all rows from all DataFrames, and materializes them into a new DataFrame via DataFrame.of(schema, rows). If all DataFrames have row indices, the indices are concatenated as well.

factorize()

The factorize() method collects distinct sorted string values from each target column, creates a NominalScale with those levels, and produces a new IntVector where each string is replaced by its integer index. Missing/null strings map to -1.

join()

The join() method performs an inner join using the row index. If either DataFrame has no index, it falls back to merge(). The join iterates over the left DataFrame's index values, looks them up in the right DataFrame's index, and collects matching rows.

Related Pages

Metadata

Property Value
Type API Doc
Language Java
Library Version 5.2.0
Last Updated 2026-02-08 22:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment