Implementation:EvolvingLMMs Lab Lmms eval Create Iterator
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Data_Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for partitioning evaluation data across processes using interleaved round-robin sharding provided by the lmms-eval framework.
Description
The create_iterator function in lmms_eval/utils.py implements round-robin data sharding using Python's itertools.islice. It takes a raw iterator over documents (typically from enumerate(task.eval_docs_no_media)) and returns a sliced iterator that yields only the documents assigned to the given rank.
The function is called from Task.build_all_requests() during the request construction phase. Each rank calls build_all_requests() with its own rank and world_size, receiving a different slice of the document iterator. This means each rank only constructs evaluation instances for its assigned documents, saving both memory and compute.
The implementation handles edge cases including:
- Null offset -- Treats
Noneoffset as 0 - Negative offset -- Raises a
ValueErrorfor invalid negative offsets - No limit -- When
limitisNone, the stop parameter ofisliceis alsoNone, consuming the entire iterator - Single GPU -- When
world_size=1, the step is 1, returning all documents (no sharding)
Usage
This function is used internally by Task.build_all_requests() whenever evaluation is distributed. It is not typically called directly by end users but is invoked automatically when launching with multiple processes via accelerate launch or torchrun.
Code Reference
Source Location
- Repository: lmms-eval
- File:
lmms_eval/utils.py - Lines: L857-870
Called from:
- File:
lmms_eval/api/task.py - Lines: L382-442
Signature
def create_iterator(
raw_iterator,
rank: int,
world_size: int,
limit: Optional[int] = None,
offset: int = 0,
) -> itertools.islice:
Import
from lmms_eval.utils import create_iterator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| raw_iterator | Iterator |
Yes | The raw document iterator, typically enumerate(task.eval_docs_no_media) yielding (doc_id, doc) tuples
|
| rank | int |
Yes | The global rank of the current process (0 to world_size-1) |
| world_size | int |
Yes | The total number of distributed processes |
| limit | Optional[int] |
No (default: None) | Maximum total number of documents to evaluate across all ranks; None means no limit
|
| offset | int |
No (default: 0) | Number of documents to skip before sharding begins; must be >= 0 |
Outputs
| Name | Type | Description |
|---|---|---|
| sliced_iterator | itertools.islice |
An iterator yielding only the elements assigned to the specified rank via round-robin selection: elements at positions rank+offset, rank+offset+world_size, rank+offset+2*world_size, ... |
Usage Examples
Basic Example
from lmms_eval.utils import create_iterator
# Suppose we have 10 documents and 4 GPUs
documents = list(range(10)) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Rank 0 gets: [0, 4, 8]
rank0_docs = list(create_iterator(iter(documents), rank=0, world_size=4))
# rank0_docs == [0, 4, 8]
# Rank 1 gets: [1, 5, 9]
rank1_docs = list(create_iterator(iter(documents), rank=1, world_size=4))
# rank1_docs == [1, 5, 9]
# Rank 2 gets: [2, 6]
rank2_docs = list(create_iterator(iter(documents), rank=2, world_size=4))
# rank2_docs == [2, 6]
# Rank 3 gets: [3, 7]
rank3_docs = list(create_iterator(iter(documents), rank=3, world_size=4))
# rank3_docs == [3, 7]
With Limit and Offset
from lmms_eval.utils import create_iterator
documents = list(range(20))
# Evaluate only the first 8 documents, starting from offset 4
# Effective range: documents[4:12] = [4, 5, 6, 7, 8, 9, 10, 11]
# Rank 0 (start=0+4=4, stop=4+8=12, step=2): [4, 6, 8, 10]
rank0 = list(create_iterator(iter(documents), rank=0, world_size=2, limit=8, offset=4))
# Rank 1 (start=1+4=5, stop=4+8=12, step=2): [5, 7, 9, 11]
rank1 = list(create_iterator(iter(documents), rank=1, world_size=2, limit=8, offset=4))
Internal Usage in Task
# From lmms_eval/api/task.py build_all_requests():
doc_id_docs = utils.create_iterator(
enumerate(self.eval_docs_no_media),
rank=rank,
limit=int(limit) if limit else None,
world_size=world_size,
offset=offset,
)
# Each rank iterates only over its assigned (doc_id, doc) pairs
for doc_id, doc in doc_id_docs:
# Build evaluation instances for this document
...