Implementation:Iterative Dvc Collect Indexes
| Knowledge Sources | |
|---|---|
| Domains | Data_Synchronization, Version_Control |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for building per-revision filtered data index views that identify which files need to be transferred during push or fetch operations, provided by the DVC library.
Description
The _collect_indexes function is the internal workhorse that prepares data transfer manifests for DVC's push and fetch commands. It iterates over requested revisions using the repository's brancher mechanism, builds filtered IndexView objects at each revision, and returns a dictionary mapping revision identifiers to their respective views. It delegates the actual index filtering to index_from_targets, which constructs a read-only IndexView by collecting stages matching the given targets and applying stage-level and output-level filters.
_collect_indexes lives in dvc/repo/fetch.py but is also imported and used by dvc/repo/push.py. The companion index_from_targets function in dvc/repo/index.py handles the target-to-stage resolution and filter application. Together they form the collection layer that sits between user-facing commands and the low-level transfer machinery in dvc-data.
Usage
Import _collect_indexes when building custom data synchronization workflows that need to enumerate transferable objects across multiple revisions. In standard usage, it is called internally by the fetch and push commands. Use index_from_targets directly when you need a filtered view of the data index for a single revision without the revision-iteration wrapper.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/repo/fetch.py - Lines: L28-97 (_collect_indexes)
- Supporting file:
dvc/repo/index.py - Lines: L946-991 (index_from_targets)
Signature
def _collect_indexes(
repo: "Repo",
targets=None,
remote=None,
all_branches=False,
with_deps=False,
all_tags=False,
recursive=False,
all_commits=False,
revs=None,
workspace=True,
max_size=None,
types=None,
config=None,
onerror=None,
push=False,
) -> dict[str, "IndexView"]:
def index_from_targets(
repo: "Repo",
targets: Optional["TargetType"] = None,
stage_filter: Optional[Callable[["Stage"], bool]] = None,
outs_filter: Optional[Callable[["Output"], bool]] = None,
max_size: Optional[int] = None,
types: Optional[list[str]] = None,
with_deps: bool = False,
recursive: bool = False,
**kwargs: Any,
) -> "IndexView":
Import
from dvc.repo.fetch import _collect_indexes
from dvc.repo.index import index_from_targets
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo | Repo | Yes | The DVC repository instance providing access to config, brancher, and the data index. |
| targets | Optional[list] | No | Specific DVC files or output paths to include. If None, all tracked outputs are included. |
| remote | Optional[str] | No | Remote name to filter by; outputs assigned to a different remote are excluded. |
| all_branches | bool | No | If True, collect indexes for all branches. |
| all_tags | bool | No | If True, collect indexes for all tags. |
| all_commits | bool | No | If True, collect indexes for all commits. |
| revs | Optional[list] | No | Explicit list of revision identifiers to iterate over. |
| with_deps | bool | No | If True, include stages that are dependencies of targeted stages. |
| recursive | bool | No | If True, recursively match targets within directories. |
| workspace | bool | No | If True (default), include the current workspace in the revision set. |
| max_size | Optional[int] | No | Exclude outputs larger than this byte size. |
| types | Optional[list[str]] | No | Restrict to specific output types: "metrics", "plots", "params". |
| config | Optional[dict] | No | Additional config overrides to merge before collection. |
| onerror | Optional[Callable] | No | Error callback invoked with (rev, entry, exc) on collection failure. |
| push | bool | No | If True, excludes repo imports and non-pushable outputs. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | dict[str, IndexView] | A dictionary mapping revision identifiers (e.g., "workspace", branch names, commit SHAs) to filtered IndexView objects. Each IndexView contains the filtered stages, outputs, data keys, and storage maps needed for transfer. |
Usage Examples
Basic Usage
from dvc.repo import Repo
from dvc.repo.fetch import _collect_indexes
repo = Repo()
# Collect indexes for the current workspace only
indexes = _collect_indexes(repo)
for rev, idx in indexes.items():
print(f"Revision: {rev}")
for out in idx.outs:
print(f" Output: {out}")
# Collect indexes across all branches for a push operation
indexes = _collect_indexes(
repo,
all_branches=True,
push=True,
remote="myremote",
)
print(f"Collected {len(indexes)} revision indexes")
Using index_from_targets Directly
from dvc.repo import Repo
from dvc.repo.index import index_from_targets
repo = Repo()
# Get a filtered view for specific targets
view = index_from_targets(
repo,
targets=["data/train.csv", "models/"],
recursive=True,
)
# Inspect what would be transferred
for key_set in view.data_keys.values():
for key in key_set:
print(f" Data key: {'/'.join(key)}")