Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Iterative Dvc Collect Indexes

From Leeroopedia


Knowledge Sources
Domains Data_Synchronization, Version_Control
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for building per-revision filtered data index views that identify which files need to be transferred during push or fetch operations, provided by the DVC library.

Description

The _collect_indexes function is the internal workhorse that prepares data transfer manifests for DVC's push and fetch commands. It iterates over requested revisions using the repository's brancher mechanism, builds filtered IndexView objects at each revision, and returns a dictionary mapping revision identifiers to their respective views. It delegates the actual index filtering to index_from_targets, which constructs a read-only IndexView by collecting stages matching the given targets and applying stage-level and output-level filters.

_collect_indexes lives in dvc/repo/fetch.py but is also imported and used by dvc/repo/push.py. The companion index_from_targets function in dvc/repo/index.py handles the target-to-stage resolution and filter application. Together they form the collection layer that sits between user-facing commands and the low-level transfer machinery in dvc-data.

Usage

Import _collect_indexes when building custom data synchronization workflows that need to enumerate transferable objects across multiple revisions. In standard usage, it is called internally by the fetch and push commands. Use index_from_targets directly when you need a filtered view of the data index for a single revision without the revision-iteration wrapper.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/repo/fetch.py
  • Lines: L28-97 (_collect_indexes)
  • Supporting file: dvc/repo/index.py
  • Lines: L946-991 (index_from_targets)

Signature

def _collect_indexes(
    repo: "Repo",
    targets=None,
    remote=None,
    all_branches=False,
    with_deps=False,
    all_tags=False,
    recursive=False,
    all_commits=False,
    revs=None,
    workspace=True,
    max_size=None,
    types=None,
    config=None,
    onerror=None,
    push=False,
) -> dict[str, "IndexView"]:
def index_from_targets(
    repo: "Repo",
    targets: Optional["TargetType"] = None,
    stage_filter: Optional[Callable[["Stage"], bool]] = None,
    outs_filter: Optional[Callable[["Output"], bool]] = None,
    max_size: Optional[int] = None,
    types: Optional[list[str]] = None,
    with_deps: bool = False,
    recursive: bool = False,
    **kwargs: Any,
) -> "IndexView":

Import

from dvc.repo.fetch import _collect_indexes
from dvc.repo.index import index_from_targets

I/O Contract

Inputs

Name Type Required Description
repo Repo Yes The DVC repository instance providing access to config, brancher, and the data index.
targets Optional[list] No Specific DVC files or output paths to include. If None, all tracked outputs are included.
remote Optional[str] No Remote name to filter by; outputs assigned to a different remote are excluded.
all_branches bool No If True, collect indexes for all branches.
all_tags bool No If True, collect indexes for all tags.
all_commits bool No If True, collect indexes for all commits.
revs Optional[list] No Explicit list of revision identifiers to iterate over.
with_deps bool No If True, include stages that are dependencies of targeted stages.
recursive bool No If True, recursively match targets within directories.
workspace bool No If True (default), include the current workspace in the revision set.
max_size Optional[int] No Exclude outputs larger than this byte size.
types Optional[list[str]] No Restrict to specific output types: "metrics", "plots", "params".
config Optional[dict] No Additional config overrides to merge before collection.
onerror Optional[Callable] No Error callback invoked with (rev, entry, exc) on collection failure.
push bool No If True, excludes repo imports and non-pushable outputs.

Outputs

Name Type Description
return dict[str, IndexView] A dictionary mapping revision identifiers (e.g., "workspace", branch names, commit SHAs) to filtered IndexView objects. Each IndexView contains the filtered stages, outputs, data keys, and storage maps needed for transfer.

Usage Examples

Basic Usage

from dvc.repo import Repo
from dvc.repo.fetch import _collect_indexes

repo = Repo()

# Collect indexes for the current workspace only
indexes = _collect_indexes(repo)
for rev, idx in indexes.items():
    print(f"Revision: {rev}")
    for out in idx.outs:
        print(f"  Output: {out}")

# Collect indexes across all branches for a push operation
indexes = _collect_indexes(
    repo,
    all_branches=True,
    push=True,
    remote="myremote",
)
print(f"Collected {len(indexes)} revision indexes")

Using index_from_targets Directly

from dvc.repo import Repo
from dvc.repo.index import index_from_targets

repo = Repo()

# Get a filtered view for specific targets
view = index_from_targets(
    repo,
    targets=["data/train.csv", "models/"],
    recursive=True,
)

# Inspect what would be transferred
for key_set in view.data_keys.values():
    for key in key_set:
        print(f"  Data key: {'/'.join(key)}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment