Principle:Wandb Weave Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
A data structuring pattern that organizes evaluation examples into a versioned, iterable collection with schema consistency.
Description
Dataset Preparation transforms raw data (lists of dicts, DataFrames, HuggingFace datasets) into a standardized, versioned collection suitable for systematic evaluation. The prepared dataset enforces consistent column schemas across rows, supports iteration and indexing, and integrates with the versioning system for reproducible experiments.
Usage
Use this principle when assembling test examples for model evaluation. The dataset defines the ground truth that models are evaluated against and must be prepared before running any evaluation pipeline.
Theoretical Basis
Evaluation datasets follow the tabular data model:
- Schema Definition: Each row is a dictionary with consistent keys (columns).
- Type Coercion: Input data from various sources is normalized to a common internal representation (Table).
- Versioning: Content-addressable storage ensures datasets are immutable once published.
- Iteration: The dataset supports sequential access for batch evaluation.
A well-prepared dataset separates input features (passed to the model) from ground truth labels (passed to scorers), enabling clean model-scorer composition.