Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook Dataset Push To Hub

From Leeroopedia


Overview

Dataset Push To Hub implements the Principle:PacktPublishing_LLM_Engineers_Handbook_Evaluation_Results_Aggregation principle by loading evaluated datasets from HuggingFace Hub, computing mean accuracy and style scores per model, and printing a summary comparison. Results are persisted on the Hub for downstream consumption.

Aspect Detail
Implementation Name Dataset Push To Hub
Workflow Model_Evaluation
Type Wrapper Doc (HuggingFace datasets)
Source File llm_engineering/model/evaluation/evaluate.py (Lines 208–228)
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Evaluation_Results_Aggregation

API Signatures

# Loading evaluated results
dataset = load_dataset(repo_id, split="all") -> Dataset

# Publishing results
dataset.push_to_hub(repo_id) -> None

Key Code

# For each model in the evaluation set:
dataset = load_dataset(
    f"{workspace}/{model_name}-results",
    split="all"
)

avg_accuracy = sum(dataset["accuracy"]) / len(dataset["accuracy"])
avg_style = sum(dataset["style"]) / len(dataset["style"])

print(f"Model: {model_name}")
print(f"  Accuracy: {avg_accuracy:.2f}")
print(f"  Style:    {avg_style:.2f}")

Imports

from datasets import load_dataset

Inputs

Input Type Description
Results datasets HuggingFace Hub datasets Datasets containing per-sample accuracy, style, and evaluation columns, published by the LLM-as-Judge scoring step
workspace str HuggingFace Hub namespace (e.g., "pauliusztin") derived from MODEL_HUGGINGFACE_WORKSPACE
model_name str Name of the model whose results are being aggregated (e.g., "llm-twin-7b")

Outputs

Output Type Description
Console summary Printed text Per-model aggregated scores: mean accuracy and mean style, formatted to two decimal places
Persisted results HuggingFace Hub dataset The evaluated dataset (with all per-sample scores) remains on Hub for downstream access

Step-by-Step Behavior

  1. Iterate over models: For each model in the evaluation configuration (typically both a fine-tuned model and a baseline), the following steps are performed
  2. Load results dataset: The results dataset (containing generated answers and judge scores) is loaded from HuggingFace Hub using load_dataset() with split="all"
  3. Compute mean accuracy: The "accuracy" column values are summed and divided by the number of samples
  4. Compute mean style: The "style" column values are summed and divided by the number of samples
  5. Print summary: The model name and aggregated scores are printed to the console in a human-readable format
  6. Results persist on Hub: The per-sample results dataset (pushed during the scoring step) remains available on HuggingFace Hub for further analysis

Example Output

Model: llm-twin-7b
  Accuracy: 2.45
  Style:    2.31

Model: TwinLlama-3.1-8B
  Accuracy: 2.12
  Style:    2.08

This output enables quick comparison: the fine-tuned llm-twin-7b model outperforms the baseline TwinLlama-3.1-8B on both accuracy and style.

Data Flow

The aggregation step consumes data produced by the full upstream pipeline:

Column Source Description
instruction Original dataset The prompt given to the model
answers Batch Inference The model's generated response
accuracy LLM-as-Judge Accuracy score (1–3)
style LLM-as-Judge Style score (1–3)
evaluation LLM-as-Judge Free-text explanation of scores

External Dependencies

Dependency Purpose
datasets (HuggingFace) Loading datasets from Hub via load_dataset() and publishing via push_to_hub()

Design Notes

  • Simple aggregation: Mean computation is deliberately simple. For a 1–3 scale with a modest number of samples, more sophisticated statistics (median, percentiles) add complexity without significantly improving decision quality.
  • Console output: Results are printed to stdout, making them visible in SageMaker Processing job logs as well as local terminal output. This dual-use approach keeps the aggregation logic environment-agnostic.
  • Hub persistence: The per-sample results remain on HuggingFace Hub even after aggregation. This allows anyone to recompute aggregates, perform deeper analysis, or debug individual low-scoring samples.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment