Principle:Pola rs Polars Aggregation Expression Building
| Knowledge Sources | |
|---|---|
| Domains | Data Engineering, DataFrame |
| Last Updated | 2026-02-09 10:00 GMT |
Overview
Constructing aggregation expressions that reduce grouped data to summary statistics, supporting standard aggregates, conditional aggregation, filtered aggregation, and sorted aggregation.
Description
Aggregation is the process of collapsing multiple rows within each group into a single summary value (or a small set of summary values). In Polars, aggregation expressions are built using the expression API and passed to the .agg() method on a GroupBy or LazyGroupBy object. The result is a DataFrame with one row per group.
Polars supports several categories of aggregation:
- Standard aggregates -- Built-in reduction functions such as
sum(),mean(),min(),max(),first(),last(),std(),var(),median(),n_unique(),count(), andpl.len(). Each reduces a column of values to a single scalar per group. - Conditional aggregation -- Using
pl.when(predicate).then(value).otherwise(default)within aggregation expressions to compute aggregates conditionally. For example, summing a boolean expression(pl.col("party") == "Anti-Administration").sum()counts rows matching a condition. - Filtered aggregation -- Chaining
.filter(predicate)before an aggregate function to restrict which rows within each group participate in the aggregation. This is equivalent to SQL'sFILTER (WHERE ...)clause. - Sorted aggregation -- Chaining
.sort()or.sort_by()before.first()or.last()to pick the smallest/largest value within each group according to a custom ordering.
A distinctive feature of Polars is that multiple aggregation expressions can be evaluated simultaneously within a single .agg() call. Each expression operates independently on the group's data, and all results are combined into the output DataFrame.
Usage
Use this pattern whenever you need to:
- Compute summary statistics (count, sum, mean, etc.) per group.
- Count rows matching specific conditions within each group.
- Select the first or last value within each group after sorting by a secondary column.
- Combine multiple aggregation metrics in a single group-by operation.
- Apply custom helper functions that return aggregation expressions for reuse.
Theoretical Basis
In relational algebra, an aggregate function F maps a multiset (bag) of values to a single value:
F: Bag(T) -> T'
Standard aggregates:
SUM: Bag(Numeric) -> Numeric (additive reduction)
COUNT: Bag(Any) -> Integer (cardinality)
AVG: Bag(Numeric) -> Float (SUM / COUNT)
MIN: Bag(Ordered) -> Ordered (least element)
MAX: Bag(Ordered) -> Ordered (greatest element)
FIRST: Bag(Any) -> Any (positionally first element)
Conditional aggregation can be expressed as a composition of filtering and reduction:
CONDITIONAL_SUM(bag, predicate) = SUM(FILTER(bag, predicate))
This is equivalent to SQL's SUM(CASE WHEN predicate THEN value ELSE 0 END) pattern.
Filtered aggregation in Polars uses the .filter() method on expressions, which restricts the input bag before applying the aggregate:
expr.filter(predicate).mean() <=> AVG(FILTER(group_values, predicate))
The MapReduce analogy is direct: the group-by operation is the map phase (partition rows by key), and each aggregation function is a reduce operation applied independently to each partition. Polars executes these reductions in parallel across groups when using the lazy engine.
| Pattern | SQL Equivalent | Polars Expression |
|---|---|---|
| Standard aggregate | SUM(col) |
pl.col("col").sum()
|
| Conditional count | SUM(CASE WHEN cond THEN 1 ELSE 0 END) |
(pl.col("col") == value).sum()
|
| Filtered mean | AVG(col) FILTER (WHERE cond) |
pl.col("col").filter(predicate).mean()
|
| Sorted first | FIRST_VALUE(col) OVER (ORDER BY sort_col) |
pl.col("col").sort_by("sort_col").first()
|