Principle:Pola rs Polars Group By Key Definition
| Knowledge Sources | |
|---|---|
| Domains | Data Engineering, DataFrame |
| Last Updated | 2026-02-09 10:00 GMT |
Overview
Defining grouping keys that partition a DataFrame into subsets for independent aggregation, supporting both column names and computed expressions as keys.
Description
Group-by is the operation that partitions the rows of a DataFrame into disjoint subsets (groups) where all rows within a group share the same values for the specified key columns. Each group is then independently processed by aggregation expressions. In Polars, grouping keys can be:
- Simple column names -- A string naming an existing column. All rows with the same value in that column belong to the same group. Example:
group_by("state"). - Computed expressions -- An arbitrary Polars expression that derives the grouping key from one or more existing columns. The expression must be aliased to provide a name for the resulting group column. Example:
group_by((pl.col("birthday").dt.year() // 10 * 10).alias("decade")). - Multi-column grouping -- Multiple keys can be specified to create groups defined by the Cartesian product of distinct values across all key columns. Example:
group_by("state", "party").
The maintain_order parameter controls whether the output DataFrame preserves the order in which groups first appear in the input. By default this is False because Polars uses hash-based grouping, which does not guarantee order. Setting maintain_order=True adds an ordering step that preserves input order at a performance cost.
Usage
Use this pattern whenever you need to:
- Partition a DataFrame by one or more categorical columns for aggregation.
- Create derived grouping keys using expressions (e.g., decade from year, first letter of name).
- Preserve the input order of groups in the output by enabling
maintain_order. - Group a LazyFrame for deferred aggregation in a lazy query pipeline.
Theoretical Basis
In relational algebra, the GROUP BY operation partitions a relation R into equivalence classes based on a set of grouping attributes G = {g_1, g_2, ..., g_k}:
GROUP_BY(R, G) = { r in R : r[g_1] = v_1 AND r[g_2] = v_2 AND ... AND r[g_k] = v_k }
for each distinct combination (v_1, v_2, ..., v_k) in pi_G(R)
where pi_G(R) is the projection of R onto the grouping attributes.
Two primary implementation strategies exist:
| Strategy | Mechanism | Order Preserved | Complexity |
|---|---|---|---|
| Hash-based grouping | Build a hash table on the grouping keys; assign each row to a bucket | No (hash order) | O(n) average, O(n^2) worst case |
| Sort-based grouping | Sort the data on grouping keys; consecutive equal keys form groups | Yes (sorted order) | O(n log n) |
Polars uses hash-based grouping by default for its superior average-case performance. When maintain_order=True is specified, Polars tracks the first-seen order of each group and applies a post-grouping sort to restore that order. This adds overhead proportional to the number of groups.
Expression-based keys are a Polars extension beyond standard SQL GROUP BY. They allow the grouping key to be any expression, not just a column reference. This is equivalent to SQL's GROUP BY expression syntax (e.g., GROUP BY YEAR(birthday) / 10 * 10) but with the full power of Polars' expression system, including chained method calls, conditional logic, and string operations.