Principle:Mlfoundations Open flamingo Data Quality Filtering
Metadata
| Field | Value |
|---|---|
| Sources | Paper: LAION-5B https://arxiv.org/abs/2210.08402 |
| Domains | Data_Preparation, Preprocessing, Data_Quality |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
Data validation pattern that filters out malformed or incomplete samples from web-scraped datasets to ensure training data quality.
Description
Web-scraped datasets like LAION frequently contain samples with missing images (download failures), missing captions (empty text fields), or corrupted data. Filtering out these incomplete samples before training prevents NaN losses, decoder errors, and wasted computation. The filtering is applied as a streaming operation within the WebDataset pipeline, checking each sample for required fields before passing it to the preprocessing stage.
Usage
When loading web-scraped image-text datasets for training; applied as a pipeline filter within the data loading stage.
Theoretical Basis
Training on incomplete data can cause numerical instability (NaN from missing inputs) or semantic noise (learning from empty captions). Streaming filters (applied via wds.select()) operate on individual samples within the WebDataset pipeline without loading the entire dataset into memory. The filter is applied after tar extraction but before image decoding, rejecting invalid samples early to avoid wasted decode computation.