Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Turboderp org Exllamav2 Bulk Dataset Inference

From Leeroopedia
Revision as of 11:04, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Turboderp_org_Exllamav2_Bulk_Dataset_Inference.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains LLMs, Inference, Batch_Processing, Data_Engineering
Last Updated 2026-02-15 00:00 GMT

Overview

End-to-end process for processing a large dataset of prompts through a quantized language model using ExLlamaV2's Dynamic Generator with concurrent batching and job queue management.

Description

This workflow demonstrates high-throughput batch inference over a dataset of prompts using the Dynamic Generator's job queue system. Rather than processing prompts one at a time, all prompts are enqueued as jobs with the generator automatically managing concurrent execution, cache allocation, and scheduling. The generator maintains the maximum possible batch size within VRAM constraints, activating new jobs as completed ones free cache pages. Results are collected asynchronously as each job completes, with full completion text available in the EOS result.

Usage

Execute this workflow when you have a large collection of prompts (hundreds to tens of thousands) that need to be processed through a language model for tasks such as dataset augmentation, evaluation benchmarks, synthetic data generation, or batch question answering. The Dynamic Generator's concurrent batching maximizes GPU utilization by overlapping prefill and generation across multiple sequences.

Execution Steps

Step 1: Model_And_Generator_Setup

Initialize the model configuration, create a large KV-cache (scaled to available VRAM), load the model with auto-split, and create the Dynamic Generator with elevated batch size and queue limits. The cache size should be maximized for throughput since it determines how many concurrent jobs can run. The generator's max_batch_size controls the upper bound on concurrent sequences.

Key considerations:

  • Cache size should be as large as VRAM allows (e.g., 100K tokens for 24 GB GPU with 8B model)
  • max_batch_size limits concurrent generation (e.g., 1024)
  • max_q_size controls how many jobs can queue before the generator paces itself
  • The model should be loaded with auto-split for optimal GPU utilization

Step 2: Dataset_Loading

Load the prompt dataset from a HuggingFace dataset, JSON file, or other source. Extract the prompt text from each row and prepare it for formatting. The dataset can contain thousands of entries; the generator's queue handles scheduling them efficiently.

Key considerations:

  • Datasets can be loaded via HuggingFace datasets library or from local files
  • Only the prompt column needs to be extracted
  • Row count can be limited for testing before running full dataset

Step 3: Prompt_Formatting

Format each raw prompt using the appropriate instruct template for the model family (e.g., Llama3, ChatML). Apply the system prompt, wrap user content in the correct delimiters, and add the assistant turn prefix. Encode each formatted prompt into token IDs with special token encoding enabled.

Key considerations:

  • Prompt format must match the model's training template
  • System prompts are included in the formatted prompt
  • Special tokens must be encoded as tokens, not text
  • Encoding is done upfront before enqueueing

Step 4: Job_Enqueueing

Create a DynamicJob for each formatted prompt with configured sampling settings, maximum response length, stop conditions, and a unique identifier linking back to the original dataset row. Enqueue all jobs with the generator. The generator immediately begins processing jobs as they are enqueued, up to the cache and batch size limits.

Key considerations:

  • Each job gets a unique identifier for result tracking
  • Sampling settings and stop conditions can be shared or per-job
  • The generator begins processing as soon as jobs are enqueued
  • Thousands of jobs can be queued without memory issues (only active jobs use VRAM)

Step 5: Result_Collection

Run the generator's iterate loop until all jobs complete. Each iteration returns results for active jobs, potentially including text chunks (during streaming) and EOS signals (when a job finishes). On EOS, the result includes the full completion text, which is stored indexed by the job's identifier. Track throughput metrics (completions per minute, tokens per second) during processing.

Key considerations:

  • The iterate loop processes all active jobs in parallel
  • EOS results contain full_completion with the complete response text
  • Batch size varies dynamically as jobs start and finish
  • Progress metrics can be computed from timestamps and completion counts

Step 6: Output_Persistence

Write the collected completions to an output file (JSON, JSONL, or other format). The completions array is indexed by the original dataset row order, preserving the mapping between input prompts and generated responses.

Key considerations:

  • Output should preserve the original dataset ordering
  • JSON is the simplest format for structured output
  • Large datasets may benefit from streaming writes (JSONL)
  • Completions can be post-processed or filtered before saving

Execution Diagram

GitHub URL

Workflow Repository