Principle:Eventual Inc Daft Row Filtering
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Data_Transformation |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Technique for filtering DataFrame rows based on boolean predicate expressions.
Description
Row filtering applies a boolean predicate to each row and retains only rows where the predicate evaluates to True. Rows where the predicate evaluates to False or Null are discarded. This supports complex conditions with AND/OR/NOT logic, comparisons, function calls, and even SQL expression strings. Row filtering is one of the most fundamental DataFrame operations and is critical for data cleaning, subsetting, and conditional analysis.
Usage
Use row filtering when you need to filter data based on conditions. Common scenarios include removing invalid records, selecting data within a date range, filtering by category, applying business rules, and subsetting data for analysis.
Theoretical Basis
Row filtering implements the relational selection (sigma) operation:
Relational Algebra:
sigma_{predicate}(R)
SQL Equivalent:
SELECT * FROM R WHERE predicate
Pseudocode:
where(df, predicate):
result = []
for row in df:
if evaluate(predicate, row) == True:
result.append(row)
return result
Predicate Composition:
- AND: (expr1) & (expr2)
- OR: (expr1) | (expr2)
- NOT: ~(expr)
- Comparison: expr1 > expr2, expr1 == expr2, etc.
Null Semantics:
- NULL comparisons yield NULL (not True)
- Rows with NULL predicates are excluded
The query optimizer can push filter predicates closer to data sources (predicate pushdown), enabling partition pruning and reducing the amount of data read from storage.