Principle:Pola rs Polars Group By Key Definition

Knowledge Sources	Polars Polars Docs
Domains	Data Engineering, DataFrame
Last Updated	2026-02-09 10:00 GMT

Overview

Defining grouping keys that partition a DataFrame into subsets for independent aggregation, supporting both column names and computed expressions as keys.

Description

Group-by is the operation that partitions the rows of a DataFrame into disjoint subsets (groups) where all rows within a group share the same values for the specified key columns. Each group is then independently processed by aggregation expressions. In Polars, grouping keys can be:

Simple column names -- A string naming an existing column. All rows with the same value in that column belong to the same group. Example: group_by("state").
Computed expressions -- An arbitrary Polars expression that derives the grouping key from one or more existing columns. The expression must be aliased to provide a name for the resulting group column. Example: group_by((pl.col("birthday").dt.year() // 10 * 10).alias("decade")).
Multi-column grouping -- Multiple keys can be specified to create groups defined by the Cartesian product of distinct values across all key columns. Example: group_by("state", "party").

The maintain_order parameter controls whether the output DataFrame preserves the order in which groups first appear in the input. By default this is False because Polars uses hash-based grouping, which does not guarantee order. Setting maintain_order=True adds an ordering step that preserves input order at a performance cost.

Usage

Use this pattern whenever you need to:

Partition a DataFrame by one or more categorical columns for aggregation.
Create derived grouping keys using expressions (e.g., decade from year, first letter of name).
Preserve the input order of groups in the output by enabling maintain_order.
Group a LazyFrame for deferred aggregation in a lazy query pipeline.

Theoretical Basis

In relational algebra, the GROUP BY operation partitions a relation R into equivalence classes based on a set of grouping attributes G = {g_1, g_2, ..., g_k}:

GROUP_BY(R, G) = { r in R : r[g_1] = v_1 AND r[g_2] = v_2 AND ... AND r[g_k] = v_k }
  for each distinct combination (v_1, v_2, ..., v_k) in pi_G(R)

where pi_G(R) is the projection of R onto the grouping attributes.

Two primary implementation strategies exist:

Strategy	Mechanism	Order Preserved	Complexity
Hash-based grouping	Build a hash table on the grouping keys; assign each row to a bucket	No (hash order)	O(n) average, O(n^2) worst case
Sort-based grouping	Sort the data on grouping keys; consecutive equal keys form groups	Yes (sorted order)	O(n log n)

Polars uses hash-based grouping by default for its superior average-case performance. When maintain_order=True is specified, Polars tracks the first-seen order of each group and applies a post-grouping sort to restore that order. This adds overhead proportional to the number of groups.

Expression-based keys are a Polars extension beyond standard SQL GROUP BY. They allow the grouping key to be any expression, not just a column reference. This is equivalent to SQL's GROUP BY expression syntax (e.g., GROUP BY YEAR(birthday) / 10 * 10) but with the full power of Polars' expression system, including chained method calls, conditional logic, and string operations.

Related Pages

Implemented By

Implementation:Pola_rs_Polars_DataFrame_Group_By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment