Principle:OpenBMB UltraFeedback Result Collection
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Construction |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
A completion aggregation and persistence strategy that collects model-generated responses and stores them in a structured JSON format for downstream annotation.
Description
Result Collection is the final step of the completion generation phase. After each model generates a response to an instruction, the result is appended to the instruction's completions array as a structured dictionary containing the model identifier, the principle category, the system prompt used, and the generated text.
The pipeline writes results back to the same JSON file it read from (in-place update), allowing incremental accumulation of completions across multiple generation passes with different models. Each completion entry preserves full provenance: which model generated it, which principle guided it, and the exact system prompt used.
Usage
Use this principle when building data generation pipelines that accumulate completions from multiple sources. The in-place JSON update pattern allows running the pipeline separately for each model while building up a complete dataset.
Theoretical Basis
The storage schema follows a nested document design where each instruction is the top-level record and completions are nested arrays. This is more natural than a flat table design because the number of completions per instruction varies.
Pseudo-code Logic:
# Abstract algorithm
for each (instruction, model, principle, response):
instruction["completions"].append({
"model": model_type,
"principle": principle_category,
"custom_system_prompt": principle_prompt_text,
"response": generated_text
})
# Persist to disk
json.dump(dataset, file, indent=4)