Implementation:Teamcapybara Capybara RegexpDisassembler
| Knowledge Sources | |
|---|---|
| Domains | Testing, Selector_System |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Internal utility class provided by Capybara::Selector::RegexpDisassembler that extracts fixed literal substrings from Ruby regular expressions for efficient CSS/XPath pre-filtering.
Description
RegexpDisassembler parses a Ruby Regexp using the regexp_parser gem and extracts substrings that must appear in any string matching the regexp. These substrings are used to generate CSS contains() or XPath contains() predicates that pre-filter the DOM before the full regexp is evaluated in Ruby, significantly reducing the number of elements that need post-query regexp matching.
initialize(regexp) stores the regexp. Two public methods provide different extraction strategies:
- substrings returns a flat Array of strings that must ALL appear in any match (AND semantics). It calls process(alternation: false), takes the first result set, and removes covered duplicates via remove_and_covered (e.g., if both "ab" and "abcd" are required, only "abcd" is kept).
- alternated_substrings returns an Array of Arrays representing OR-of-AND groups. It calls process(alternation: true) and removes covered alternation sets via remove_or_covered. Returns empty if any alternation branch yields no substrings.
Both methods upcase all extracted strings when the regexp has the casefold? (case-insensitive) flag.
The private processing pipeline works in three stages: extract_strings walks the parsed regexp AST via the nested Expression class, combine handles alternation branching using Set-based products, and collapse joins adjacent literal fragments and removes empty segments.
The nested Expression class wraps Regexp::Parser expression nodes and classifies them: terminal? nodes yield literal text or escape characters, optional? nodes (zero-minimum repeats) produce option sets, alternation? nodes generate Set objects for branching, and indeterminate? nodes (meta/set types) produce nil breaks. Negative lookahead and lookbehind assertions are explicitly ignored.
Usage
Used internally by the selector system when a Regexp locator is passed to selectors that support it, enabling XPath/CSS pre-filtering optimization.
Code Reference
Source Location
- Repository: capybara
- File: lib/capybara/selector/regexp_disassembler.rb (211 lines)
Signature
module Capybara
class Selector
class RegexpDisassembler
def initialize(regexp)
# @param regexp [Regexp] The regular expression to disassemble
end
def alternated_substrings
# @return [Array<Array<String>>] OR-of-AND substring groups
# Each inner array is a set of strings that must ALL appear (AND)
# Any one inner array matching is sufficient (OR)
# Returns [] if any alternation branch yields no substrings
end
def substrings
# @return [Array<String>] Strings that must ALL appear in any match
# Covered substrings are removed (e.g., "ab" removed if "abcd" present)
end
end
end
end
Import
require 'regexp_parser'
require 'capybara/selector/regexp_disassembler'
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| regexp | Regexp | Yes | Ruby regular expression to extract literal substrings from |
Outputs
| Name | Type | Description |
|---|---|---|
| substrings | Array<String> | Fixed literal substrings that must ALL appear in any match (AND semantics) |
| alternated_substrings | Array<Array<String>> | OR-of-AND substring groups for alternation-aware pre-filtering |
Internal Details
Private Processing Pipeline
def process(alternation:)
# 1. Parse regexp via Regexp::Parser.parse(@regexp)
# 2. Extract strings via Expression#extract_strings
# 3. Combine alternation branches via combine (uses Set-based products)
# 4. Collapse adjacent literals via collapse (join, reject empty, uniq)
# 5. Upcase all strings if @regexp.casefold?
end
Nested Expression Class
The private Expression class wraps Regexp::Parser AST nodes and provides:
| Method | Description |
|---|---|
| extract_strings(process_alternatives) | Walks child expressions, collecting literal strings and nil breaks |
| alternation? | ) |
| optional? | True when minimum repeat count is zero |
| terminal? | True for leaf nodes (literals, escapes) |
| strings(process_alternatives) | Dispatches to terminal_strings, optional_strings, or repeated_strings |
| alternative_strings | Returns Set of alternation branch extractions |
| ignore? | True for negative lookahead/lookbehind assertions |
Usage Examples
Extract AND Substrings
disassembler = Capybara::Selector::RegexpDisassembler.new(/foo\d+bar/)
disassembler.substrings
# => ["foo", "bar"]
# Both "foo" and "bar" must appear in any matching string
Extract OR-of-AND Substrings
disassembler = Capybara::Selector::RegexpDisassembler.new(/hello|world/)
disassembler.alternated_substrings
# => [["hello"], ["world"]]
# Either "hello" OR "world" must appear
Case-Insensitive Regexp
disassembler = Capybara::Selector::RegexpDisassembler.new(/FooBar/i)
disassembler.substrings
# => ["FOOBAR"]
# Upcased because casefold? is true
Related Pages
Implements Principle
- Principle:Teamcapybara_Capybara_Selector_Definition
- Principle:Teamcapybara_Capybara_Expression_Filtering