Heuristic:Dotnet Machinelearning Text File Sampling Strategy
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Optimization |
| Last Updated | 2026-02-09 11:00 GMT |
Overview
Multi-chunk file sampling strategy with 4MB buffer, 10% oversampling, and 98% column uniformity threshold for robust automatic schema inference.
Description
ML.NET's AutoML column inference uses a sophisticated multi-chunk sampling strategy to infer data file schemas without reading entire files. It reads a 1MB initial chunk to estimate line lengths, then samples from multiple positions (beginning, middle, end) within a 4MB buffer. A 10% oversampling rate ensures coverage of edge cases, and a 98% column uniformity threshold tolerates minor formatting inconsistencies while rejecting truly malformed files.
Usage
Use this heuristic when loading text data files with AutoML or implementing custom data loading logic. The sampling parameters provide good defaults for most CSV/TSV files. Consider adjusting the column uniformity threshold for datasets known to have legitimate formatting variations.
The Insight (Rule of Thumb)
File Sampling:
- Action: Use multi-chunk sampling instead of reading the entire file for schema inference.
- Values: `BufferSizeMb=4`, `FirstChunkSizeMb=1`, `LinesPerChunk=20`, `OversamplingRate=1.1`
- Trade-off: Fast inference (~4MB read) at the cost of potentially missing anomalies in the file middle/end. Oversampling mitigates this.
Column Uniformity:
- Action: Require 98% of sampled rows to have the same column count.
- Value: `UniformColumnCountThreshold = 0.98`
- Trade-off: Tolerates up to 2% ragged rows (common in real-world data) while catching truly malformed files.
Row Sampling for Inference:
- Action: Read at most 1,000 rows for column type and purpose inference.
- Value: `MaxRowsToRead = 1000`
- Trade-off: Beyond 1,000 rows, type inference accuracy has diminishing returns. Saves time on large files.
Column Count Threshold:
- Action: Switch inference strategy above 10,000 columns.
- Value: `SmartColumnsLim = 10000`
- Trade-off: Detailed per-column inference is too slow above 10K columns; use simplified strategy.
Reasoning
The multi-chunk strategy ensures representative sampling across the file. The first 1MB chunk determines average line length, which then guides how to distribute the remaining 3MB budget across the file. Oversampling by 10% accounts for variance in line lengths and ensures the buffer captures enough data.
The 98% column uniformity threshold was chosen to handle common real-world issues (trailing commas, occasional blank lines) while still detecting incorrectly formatted files. A threshold much lower than 98% would accept genuinely broken files; much higher would reject files with minor formatting quirks.
The 1,000-row limit for type inference is based on the observation that column type distributions (numeric, text, boolean) stabilize well before 1,000 rows in typical datasets. Reading more rows adds time without improving accuracy.
Code Evidence
Sampling constants from `src/Microsoft.ML.AutoML/ColumnInference/TextFileSample.cs:23-26`:
private const int BufferSizeMb = 4; // Total buffer size in MB
private const int FirstChunkSizeMb = 1; // Initial exploration chunk
private const int LinesPerChunk = 20; // Sample lines per chunk
private const Double OversamplingRate = 1.1; // 10% oversampling for robustness
Column uniformity from `src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs:40`:
private const Double UniformColumnCountThreshold = 0.98;
Max rows from `src/Microsoft.ML.AutoML/ColumnInference/PurposeInference.cs:18`:
public const int MaxRowsToRead = 1000;
Smart columns limit from `src/Microsoft.ML.AutoML/ColumnInference/ColumnTypeInference.cs:23`:
private const int SmartColumnsLim = 10000;
Graceful fallback from `src/Microsoft.ML.AutoML/ColumnInference/TextFileContents.cs:122`:
// fail gracefully if unable to instantiate data view with swept arguments