Implementation:FlagOpen FlagEmbedding Reinforced IR Multi GPU
| Knowledge Sources | |
|---|---|
| Domains | Distributed Computing, LLM Inference, Data Generation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Multi-GPU parallel text generation system for scaling LLM-based data generation in Reinforced IR pipeline.
Description
This module implements a multi-GPU parallel processing system for generating synthetic data at scale using large language models. It uses Python multiprocessing to spawn separate processes for each GPU, where each process handles a split of the input data independently. The system reads JSON files containing prompts, divides them across available GPUs, runs generation in parallel, and merges results back together.
The implementation uses CUDA_VISIBLE_DEVICES to assign specific GPUs to each process, preventing memory conflicts. Each worker loads its own instance of the LLM with configurable memory utilization. After all workers complete, the main process merges output files from temporary split directories and optionally cleans up intermediate files. This approach enables efficient scaling of data generation tasks that would be impractical on a single GPU due to time constraints.
Usage
Use this script to parallelize LLM-based data generation across multiple GPUs when processing large datasets for the Reinforced IR pipeline, particularly for query generation or augmentation at scale.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/Reinforced_IR/inference/multi.py
- Lines: 1-167
Signature
def worker_function(device):
"""Worker function that runs on each GPU"""
def merge(args: Args):
"""Merge results from all workers"""
if __name__ == "__main__":
"""Main entry point for multi-GPU processing"""
Import
import os
import json
import shutil
import multiprocessing
from dataclasses import dataclass, field
from transformers import HfArgumentParser
from agent import LLMInstructAgent, LLMAgent
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| generate_model_path | str | Yes | Path to LLM for generation |
| input_dir | str | Yes | Directory containing JSON files with prompts |
| output_dir | str | Yes | Directory to save generated outputs |
| num_gpus | int | Yes | Number of GPUs to use for parallel processing |
| temperature | float | No | LLM generation temperature (default: 0.8) |
| gpu_memory_utilization | float | No | GPU memory fraction per worker (default: 0.8) |
| max_tokens | int | No | Max tokens per generation (default: 300) |
| model_type | str | No | LLM type: "llm" or "llm_instruct" (default: "llm_instruct") |
| rm_tmp | bool | No | Remove temporary split directories (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| output JSON files | JSON | Generated text for each input file, merged across GPUs |
| tmp_split_N directories | Directory | Temporary worker outputs (removed if rm_tmp=True) |
Usage Examples
# Command line usage
python multi.py \
--generate_model_path Meta-Llama-3-8B-Instruct \
--model_type llm_instruct \
--input_dir ./prompts \
--output_dir ./outputs \
--num_gpus 8 \
--temperature 0.7 \
--gpu_memory_utilization 0.9 \
--max_tokens 300 \
--rm_tmp True
# Input directory structure:
# ./prompts/
# dataset1.json # List of prompt strings
# dataset2.json
# Output directory structure (during processing):
# ./outputs/
# tmp_split_0/
# dataset1.json # Partial results from GPU 0
# tmp_split_1/
# dataset1.json # Partial results from GPU 1
# ...
# Final output (after merge):
# ./outputs/
# dataset1.json # Merged results from all GPUs
# dataset2.json
# Input JSON format:
[
"Prompt 1 text",
"Prompt 2 text",
"Prompt 3 text"
]
# Output JSON format:
[
"Generated response 1",
"Generated response 2",
"Generated response 3"
]
# The script automatically:
# 1. Splits data across num_gpus workers
# 2. Each worker processes its split independently
# 3. Results are merged in original order
# 4. Temporary files are cleaned up