Principle:Huggingface Open r1 High Concurrency Inference
Overview
An inference scaling technique that maximizes throughput for large-scale text generation by using asynchronous HTTP requests, chunked dataset processing, and resumable execution against vLLM-compatible API servers.
Description
When generating reasoning traces at scale (millions of examples), single-threaded synchronous generation is too slow. This principle addresses throughput by:
- Async HTTP clients -- using
aiohttpto maintain thousands of concurrent requests to a vLLM server - Chunked processing -- iterating over datasets in chunks to manage memory
- Resumable output -- JSONL-based output with deduplication by UUID columns to skip already-processed examples
- Retry budgets -- handling transient failures with configurable retry limits
- High-performance event loop --
uvloopfor faster async I/O
The approach complements the Distilabel pipeline method by offering finer control over concurrency and resumability. This is the approach used by scripts/generate_reasoning.py for high-throughput generation.
Usage
Use when generating at scale (>100k examples) where fine-grained concurrency control and resumability are critical. For smaller-scale generation, the Distilabel pipeline approach (Principle:Huggingface_Open_r1_Synthetic_Data_Generation) is simpler.
Theoretical Basis
The async concurrent generation pattern works by loading previously completed UUIDs from the output file, then processing the dataset in chunks with semaphore-bounded concurrency:
processed_uuids = load_existing_output(output_file)
semaphore = Semaphore(max_concurrent)
for chunk in dataset.iter(chunk_size):
tasks = []
for example in chunk:
if example.uuid in processed_uuids:
continue
task = generate_with_retry(session, example, retry_budget=10)
tasks.append(task)
results = await gather(*tasks)
write_jsonl(output_file, results)
The pipeline begins by scanning the output JSONL file to collect UUIDs of already-processed examples, enabling resumability across interrupted runs. The semaphore bounds the number of in-flight requests to prevent overwhelming the vLLM server. Each chunk of the dataset is iterated over, skipping already-processed examples, and remaining examples are dispatched as async tasks with a retry budget to handle transient API failures. Results are gathered and appended to the JSONL output file after each chunk completes.