Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Pola rs Polars Group By Key Definition

From Leeroopedia


Knowledge Sources
Domains Data Engineering, DataFrame
Last Updated 2026-02-09 10:00 GMT

Overview

Defining grouping keys that partition a DataFrame into subsets for independent aggregation, supporting both column names and computed expressions as keys.

Description

Group-by is the operation that partitions the rows of a DataFrame into disjoint subsets (groups) where all rows within a group share the same values for the specified key columns. Each group is then independently processed by aggregation expressions. In Polars, grouping keys can be:

  1. Simple column names -- A string naming an existing column. All rows with the same value in that column belong to the same group. Example: group_by("state").
  2. Computed expressions -- An arbitrary Polars expression that derives the grouping key from one or more existing columns. The expression must be aliased to provide a name for the resulting group column. Example: group_by((pl.col("birthday").dt.year() // 10 * 10).alias("decade")).
  3. Multi-column grouping -- Multiple keys can be specified to create groups defined by the Cartesian product of distinct values across all key columns. Example: group_by("state", "party").

The maintain_order parameter controls whether the output DataFrame preserves the order in which groups first appear in the input. By default this is False because Polars uses hash-based grouping, which does not guarantee order. Setting maintain_order=True adds an ordering step that preserves input order at a performance cost.

Usage

Use this pattern whenever you need to:

  • Partition a DataFrame by one or more categorical columns for aggregation.
  • Create derived grouping keys using expressions (e.g., decade from year, first letter of name).
  • Preserve the input order of groups in the output by enabling maintain_order.
  • Group a LazyFrame for deferred aggregation in a lazy query pipeline.

Theoretical Basis

In relational algebra, the GROUP BY operation partitions a relation R into equivalence classes based on a set of grouping attributes G = {g_1, g_2, ..., g_k}:

GROUP_BY(R, G) = { r in R : r[g_1] = v_1 AND r[g_2] = v_2 AND ... AND r[g_k] = v_k }
  for each distinct combination (v_1, v_2, ..., v_k) in pi_G(R)

where pi_G(R) is the projection of R onto the grouping attributes.

Two primary implementation strategies exist:

Strategy Mechanism Order Preserved Complexity
Hash-based grouping Build a hash table on the grouping keys; assign each row to a bucket No (hash order) O(n) average, O(n^2) worst case
Sort-based grouping Sort the data on grouping keys; consecutive equal keys form groups Yes (sorted order) O(n log n)

Polars uses hash-based grouping by default for its superior average-case performance. When maintain_order=True is specified, Polars tracks the first-seen order of each group and applies a post-grouping sort to restore that order. This adds overhead proportional to the number of groups.

Expression-based keys are a Polars extension beyond standard SQL GROUP BY. They allow the grouping key to be any expression, not just a column reference. This is equivalent to SQL's GROUP BY expression syntax (e.g., GROUP BY YEAR(birthday) / 10 * 10) but with the full power of Polars' expression system, including chained method calls, conditional logic, and string operations.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment