Implementation:Haifengl Smile DataFrame Column Operations
Overview
The DataFrame Column Operations API provides methods on the DataFrame record for selecting, dropping, merging, concatenating, joining, adding, and factorizing columns. These methods return new DataFrame instances (immutable-style) or modify in place where documented, enabling composable data preparation pipelines.
API Summary
| Method | Return Type | Description |
|---|---|---|
select(String... names) |
DataFrame |
Returns a new DataFrame with only the named columns |
select(int... indices) |
DataFrame |
Returns a new DataFrame with columns at the given indices |
drop(String... names) |
DataFrame |
Returns a new DataFrame without the named columns |
drop(int... indices) |
DataFrame |
Returns a new DataFrame without columns at the given indices |
merge(DataFrame... dataframes) |
DataFrame |
Merges columns horizontally; renames duplicates with suffix |
concat(DataFrame... dataframes) |
DataFrame |
Concatenates rows vertically; requires identical schemas |
join(DataFrame other) |
DataFrame |
Inner join on row index; falls back to merge if no index |
add(ValueVector... vectors) |
DataFrame |
Appends new columns (mutates this DataFrame) |
set(String name, ValueVector column) |
DataFrame |
Sets/replaces a column by name (mutates this DataFrame) |
factorize(String... names) |
DataFrame |
Returns a new DataFrame with string columns encoded as integers |
dropna() |
DataFrame |
Returns a new DataFrame without rows containing null values |
fillna(double value) |
DataFrame |
Fills null/NaN/Inf in numeric columns with the given value |
setIndex(String column) |
DataFrame |
Sets a column as the row index and removes it from columns |
Source Location
| Property | Value |
|---|---|
| File | base/src/main/java/smile/data/DataFrame.java
|
| Lines | L490-731 (core column operations) |
| Package | smile.data
|
| Repository | github.com/haifengl/smile |
Import
import smile.data.DataFrame;
import smile.data.vector.ValueVector;
import smile.data.vector.DoubleVector;
import smile.data.vector.IntVector;
import smile.data.vector.StringVector;
Type: API Doc
Signature
public record DataFrame(StructType schema, List<ValueVector> columns, RowIndex index)
implements Iterable<Row>, Serializable {
// Column selection
public DataFrame select(String... names)
public DataFrame select(int... indices)
public DataFrame apply(String... names) // alias for select
// Column removal
public DataFrame drop(String... names)
public DataFrame drop(int... indices)
// Column addition/replacement
public DataFrame add(ValueVector... vectors)
public DataFrame set(String name, ValueVector column)
public DataFrame update(String name, ValueVector column) // alias for set
// Horizontal merge (by columns)
public DataFrame merge(DataFrame... dataframes)
public DataFrame join(DataFrame other)
// Vertical concatenation (by rows)
public DataFrame concat(DataFrame... dataframes)
// Categorical encoding
public DataFrame factorize(String... names)
// Missing value handling
public DataFrame dropna()
public DataFrame fillna(double value)
// Index management
public DataFrame setIndex(String column)
public DataFrame setIndex(Object[] index)
}
Inputs and Outputs
| Method | Input | Output | Mutates? |
|---|---|---|---|
select(names) |
Column name strings | New DataFrame with selected columns | No |
select(indices) |
Column index integers | New DataFrame with selected columns | No |
drop(names) |
Column name strings | New DataFrame without those columns | No |
drop(indices) |
Column index integers | New DataFrame without those columns | No |
merge(dfs) |
One or more DataFrames | New DataFrame with combined columns | No |
concat(dfs) |
One or more DataFrames (same schema) | New DataFrame with combined rows | No |
join(other) |
Another DataFrame | New DataFrame via inner join on index | No |
add(vectors) |
ValueVector column(s) | This DataFrame (modified) | Yes |
set(name, col) |
Column name + ValueVector | This DataFrame (modified) | Yes |
factorize(names) |
Column name strings (or empty for all string columns) | New DataFrame with encoded columns | No |
Usage Examples
Example 1: Selecting columns by name
import smile.io.Read;
import smile.data.DataFrame;
DataFrame iris = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Select only the feature columns (exclude species label)
DataFrame features = iris.select("sepal_length", "sepal_width",
"petal_length", "petal_width");
System.out.println("Feature columns: " + features.ncol()); // 4
System.out.println(features.head(5));
Example 2: Dropping identifier columns
import smile.io.Read;
import smile.data.DataFrame;
DataFrame data = Read.csv("data/customers.csv",
"delimiter=,,header=true");
// Remove non-predictive columns
DataFrame cleaned = data.drop("customer_id", "name", "email");
System.out.println("Remaining columns: " +
String.join(", ", cleaned.names()));
Example 3: Factorizing categorical columns
import smile.io.Read;
import smile.data.DataFrame;
DataFrame iris = Read.csv("data/iris.csv",
"delimiter=,,header=true");
// Convert "species" string column to integer encoding
DataFrame encoded = iris.factorize("species");
// The species column is now integer-valued with NominalScale
System.out.println(encoded.schema());
// species column: int with NominalScale(setosa, versicolor, virginica)
// Without arguments, factorize() converts ALL string columns
DataFrame allEncoded = iris.factorize();
Example 4: Merging DataFrames horizontally
import smile.io.Read;
import smile.data.DataFrame;
import smile.data.vector.DoubleVector;
DataFrame data = Read.csv("data/features.csv",
"delimiter=,,header=true");
// Compute a derived column and add it
double[] ratios = new double[data.nrow()];
for (int i = 0; i < data.nrow(); i++) {
ratios[i] = data.getDouble(i, 0) / data.getDouble(i, 1);
}
// Create a single-column DataFrame and merge
DataFrame ratioFrame = new DataFrame(new DoubleVector("ratio", ratios));
DataFrame merged = data.merge(ratioFrame);
System.out.println("Merged columns: " +
String.join(", ", merged.names()));
Example 5: Concatenating DataFrames vertically
import smile.io.Read;
import smile.data.DataFrame;
DataFrame train = Read.csv("data/train.csv",
"delimiter=,,header=true");
DataFrame test = Read.csv("data/test.csv",
"delimiter=,,header=true");
// Combine training and test sets (must have same schema)
DataFrame combined = train.concat(test);
System.out.println("Combined rows: " + combined.nrow());
// combined.nrow() == train.nrow() + test.nrow()
Example 6: Handling missing values
import smile.io.Read;
import smile.data.DataFrame;
DataFrame data = Read.csv("data/messy.csv",
"delimiter=,,header=true");
// Option A: Remove rows with any null values
DataFrame complete = data.dropna();
System.out.println("Complete cases: " + complete.nrow());
// Option B: Fill nulls/NaN/Inf with a constant
data.fillna(0.0);
Example 7: Selecting columns by index
import smile.io.Read;
import smile.data.DataFrame;
DataFrame data = Read.csv("data/wide_table.csv",
"delimiter=,,header=true");
// Select the first 5 columns by index
DataFrame subset = data.select(0, 1, 2, 3, 4);
// Drop column at index 0 (e.g., an ID column)
DataFrame noId = data.drop(0);
Implementation Details
select() and drop()
Both select() and drop() create new DataFrame instances by filtering the columns list. The row index is preserved. The select(String...) method uses schema.indexOf(name) to resolve column names to indices.
merge()
The merge() method checks that all DataFrames have the same row count, then combines their column lists. Duplicate column names are resolved by appending a suffix (_2, _3, etc.) using a Set to track existing names.
concat()
The concat() method verifies schema equality, streams all rows from all DataFrames, and materializes them into a new DataFrame via DataFrame.of(schema, rows). If all DataFrames have row indices, the indices are concatenated as well.
factorize()
The factorize() method collects distinct sorted string values from each target column, creates a NominalScale with those levels, and produces a new IntVector where each string is replaced by its integer index. Missing/null strings map to -1.
join()
The join() method performs an inner join using the row index. If either DataFrame has no index, it falls back to merge(). The join iterates over the left DataFrame's index values, looks them up in the right DataFrame's index, and collects matching rows.
Related Pages
Metadata
| Property | Value |
|---|---|
| Type | API Doc |
| Language | Java |
| Library Version | 5.2.0 |
| Last Updated | 2026-02-08 22:00 GMT |