Implementation:Evidentlyai Evidently Legacy Data Loader
| Knowledge Sources | |
|---|---|
| Domains | ML Monitoring, Data Loading, Data Pipeline |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
DataLoader provides CSV data loading with configurable sampling strategies (none, nth-row, and randomized) for loading datasets into pandas DataFrames in the Evidently legacy pipeline.
Description
This module defines a data loading subsystem with three main components:
SamplingOptions (dataclass) -- configures how rows are sampled during loading:
type-- sampling strategy:"none"(load all),"nth"(every nth row), or"random"(random sampling). Default:"none".random_seed-- seed for reproducible random sampling. Default:1.ratio-- probability ratio for random sampling (0.0 to 1.0). Default:1.0.n-- interval for nth-row sampling. Default:1.
DataOptions (dataclass) -- configures CSV parsing:
date_column-- column name to parse as datetime. Default:"datetime".separator-- CSV field separator. Default:",".header-- whether the CSV has a header row. Default:True.column_names-- explicit column names, orNoneto infer from data.
DataLoader -- the main class with a single load method that:
- Reads a CSV file using
pd.read_csv - Applies a
skiprowsfunction based on the sampling options - Parses the date column if specified
- Handles the header row based on
DataOptions.header
RandomizedSkipRows -- an internal class that implements chunk-based random row selection. It generates random boolean arrays in chunks of CHUNK_SIZE (1000) rows for memory-efficient random sampling.
Internal helper functions:
- _skiprows(sampling_options) -- resolves the sampling type to a callable skip function or
None - __simple(sampling_options) -- creates a skip function for nth-row sampling (keeps rows where
row_idx % n == 1)
Usage
Use DataLoader when loading CSV data files for Evidently analysis, particularly when you need to sample large datasets for faster processing or development iteration.
Code Reference
Source Location
- Repository: Evidentlyai_Evidently
- File:
src/evidently/legacy/runner/loader.py
Signature
@dataclasses.dataclass
class SamplingOptions:
type: str = "none"
random_seed: int = 1
ratio: float = 1.0
n: int = 1
@dataclasses.dataclass
class DataOptions:
date_column: str
separator: str
header: bool
column_names: Optional[List[str]]
def __init__(self, date_column="datetime", separator=",", header=True, column_names=None):
...
class DataLoader:
def __init__(self): ...
def load(
self,
filename: str,
data_options: DataOptions,
sampling_options: SamplingOptions = None,
) -> pd.DataFrame: ...
CHUNK_SIZE = 1000
class RandomizedSkipRows:
def __init__(self, ratio: float, random_seed: int): ...
def skiprows(self, row_index: int) -> bool: ...
Import
from evidently.legacy.runner.loader import DataLoader
from evidently.legacy.runner.loader import DataOptions
from evidently.legacy.runner.loader import SamplingOptions
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| filename | str |
Yes | Path to the CSV file to load. |
| data_options | DataOptions |
Yes | CSV parsing configuration (date column, separator, header, column names). |
| sampling_options | SamplingOptions |
No | Row sampling configuration. Defaults to no sampling. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | pd.DataFrame |
A pandas DataFrame containing the loaded and optionally sampled data. |
Usage Examples
from evidently.legacy.runner.loader import DataLoader, DataOptions, SamplingOptions
loader = DataLoader()
# Load entire CSV with default options
data_options = DataOptions(date_column="datetime", separator=",")
df = loader.load("data/train.csv", data_options)
# Load with nth-row sampling (every 5th row)
sampling = SamplingOptions(type="nth", n=5)
df_sampled = loader.load("data/train.csv", data_options, sampling_options=sampling)
# Load with random sampling (50% of rows)
sampling = SamplingOptions(type="random", ratio=0.5, random_seed=42)
df_random = loader.load("data/train.csv", data_options, sampling_options=sampling)
# Load a CSV without header and with custom separator
data_options = DataOptions(
date_column=None,
separator="\t",
header=False,
column_names=["col1", "col2", "col3"],
)
df = loader.load("data/raw.tsv", data_options)