Implementation:FlagOpen FlagEmbedding Reinforced IR Multi GPU

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Distributed Computing, LLM Inference, Data Generation
Last Updated	2026-02-09 00:00 GMT

Overview

Multi-GPU parallel text generation system for scaling LLM-based data generation in Reinforced IR pipeline.

Description

This module implements a multi-GPU parallel processing system for generating synthetic data at scale using large language models. It uses Python multiprocessing to spawn separate processes for each GPU, where each process handles a split of the input data independently. The system reads JSON files containing prompts, divides them across available GPUs, runs generation in parallel, and merges results back together.

The implementation uses CUDA_VISIBLE_DEVICES to assign specific GPUs to each process, preventing memory conflicts. Each worker loads its own instance of the LLM with configurable memory utilization. After all workers complete, the main process merges output files from temporary split directories and optionally cleans up intermediate files. This approach enables efficient scaling of data generation tasks that would be impractical on a single GPU due to time constraints.

Usage

Use this script to parallelize LLM-based data generation across multiple GPUs when processing large datasets for the Reinforced IR pipeline, particularly for query generation or augmentation at scale.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/Reinforced_IR/inference/multi.py
Lines: 1-167

Signature

def worker_function(device):
    """Worker function that runs on each GPU"""

def merge(args: Args):
    """Merge results from all workers"""

if __name__ == "__main__":
    """Main entry point for multi-GPU processing"""

Import

import os
import json
import shutil
import multiprocessing
from dataclasses import dataclass, field
from transformers import HfArgumentParser
from agent import LLMInstructAgent, LLMAgent

I/O Contract

Inputs

Name	Type	Required	Description
generate_model_path	str	Yes	Path to LLM for generation
input_dir	str	Yes	Directory containing JSON files with prompts
output_dir	str	Yes	Directory to save generated outputs
num_gpus	int	Yes	Number of GPUs to use for parallel processing
temperature	float	No	LLM generation temperature (default: 0.8)
gpu_memory_utilization	float	No	GPU memory fraction per worker (default: 0.8)
max_tokens	int	No	Max tokens per generation (default: 300)
model_type	str	No	LLM type: "llm" or "llm_instruct" (default: "llm_instruct")
rm_tmp	bool	No	Remove temporary split directories (default: True)

Outputs

Name	Type	Description
output JSON files	JSON	Generated text for each input file, merged across GPUs
tmp_split_N directories	Directory	Temporary worker outputs (removed if rm_tmp=True)

Usage Examples

# Command line usage
python multi.py \
    --generate_model_path Meta-Llama-3-8B-Instruct \
    --model_type llm_instruct \
    --input_dir ./prompts \
    --output_dir ./outputs \
    --num_gpus 8 \
    --temperature 0.7 \
    --gpu_memory_utilization 0.9 \
    --max_tokens 300 \
    --rm_tmp True

# Input directory structure:
# ./prompts/
#   dataset1.json  # List of prompt strings
#   dataset2.json

# Output directory structure (during processing):
# ./outputs/
#   tmp_split_0/
#     dataset1.json  # Partial results from GPU 0
#   tmp_split_1/
#     dataset1.json  # Partial results from GPU 1
#   ...

# Final output (after merge):
# ./outputs/
#   dataset1.json  # Merged results from all GPUs
#   dataset2.json

# Input JSON format:
[
    "Prompt 1 text",
    "Prompt 2 text",
    "Prompt 3 text"
]

# Output JSON format:
[
    "Generated response 1",
    "Generated response 2",
    "Generated response 3"
]

# The script automatically:
# 1. Splits data across num_gpus workers
# 2. Each worker processes its split independently
# 3. Results are merged in original order
# 4. Temporary files are cleaned up

Related Pages

Principle:FlagOpen_FlagEmbedding_Reinforced_Domain_Adaptation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment