Implementation:Eventual Inc Daft Regexp Extract
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Text_Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for extracting regex matches from string expressions provided by the Daft library.
Description
The regexp_extract function extracts the specified match group from the first regex match in each string of a String expression. If index is 0 (the default), the entire match is returned. If the pattern does not match or the requested capture group does not exist, a null value is returned. The pattern can be a static string or a dynamic Expression for row-level pattern variation.
Usage
Use this function as a standalone function or via the Expression method .str.extract() when you need to parse substrings from text columns using regular expressions.
Code Reference
Source Location
- Repository: Daft
- File:
daft/functions/str.py - Lines: L1072-1129
Signature
def regexp_extract(
expr: Expression,
pattern: str | Expression,
index: int = 0,
) -> Expression
Import
from daft.functions import regexp_extract
# or use as an Expression method
import daft
daft.col("text").str.extract(pattern, index)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| expr | Expression (String) | Yes | A String expression to extract matches from. |
| pattern | Expression | Yes | The regular expression pattern to match. Can be a static string or a dynamic Expression. |
| index | int | No | The index of the capture group to extract. 0 returns the entire match; 1 returns the first capture group, etc. Defaults to 0.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return | Expression (String) | A String expression containing the extracted match for each row, or null if no match or the group does not exist. |
Usage Examples
Basic Usage
import daft
from daft.functions import regexp_extract
regex = r"(\d)(\d*)"
df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
df = df.with_column("match", regexp_extract(df["x"], regex))
df.collect()
# Returns: "123", "789", "345" (entire first match)
Extract Specific Capture Group
import daft
from daft.functions import regexp_extract
regex = r"(\d)(\d*)"
df = daft.from_pydict({"x": ["123-456", "789-012", "345-678"]})
# Extract first capture group (single digit)
df = df.with_column("first_digit", regexp_extract(df["x"], regex, 1))
df.collect()
# Returns: "1", "7", "3"
Using Expression Method
import daft
df = daft.from_pydict({"text": ["email: user@example.com", "contact: admin@test.org"]})
df = df.with_column("email", daft.col("text").str.extract(r"[\w.]+@[\w.]+"))
df.collect()