Implementation:PacktPublishing LLM Engineers Handbook Dataset Push To Hub

Overview

Dataset Push To Hub implements the Principle:PacktPublishing_LLM_Engineers_Handbook_Evaluation_Results_Aggregation principle by loading evaluated datasets from HuggingFace Hub, computing mean accuracy and style scores per model, and printing a summary comparison. Results are persisted on the Hub for downstream consumption.

Aspect	Detail
Implementation Name	Dataset Push To Hub
Workflow	Model_Evaluation
Type	Wrapper Doc (HuggingFace datasets)
Source File	llm_engineering/model/evaluation/evaluate.py (Lines 208–228)
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Evaluation_Results_Aggregation

API Signatures

# Loading evaluated results
dataset = load_dataset(repo_id, split="all") -> Dataset

# Publishing results
dataset.push_to_hub(repo_id) -> None

Key Code

# For each model in the evaluation set:
dataset = load_dataset(
    f"{workspace}/{model_name}-results",
    split="all"
)

avg_accuracy = sum(dataset["accuracy"]) / len(dataset["accuracy"])
avg_style = sum(dataset["style"]) / len(dataset["style"])

print(f"Model: {model_name}")
print(f"  Accuracy: {avg_accuracy:.2f}")
print(f"  Style:    {avg_style:.2f}")

Imports

from datasets import load_dataset

Inputs

Input	Type	Description
Results datasets	HuggingFace Hub datasets	Datasets containing per-sample `accuracy`, `style`, and `evaluation` columns, published by the LLM-as-Judge scoring step
`workspace`	`str`	HuggingFace Hub namespace (e.g., `"pauliusztin"`) derived from `MODEL_HUGGINGFACE_WORKSPACE`
`model_name`	`str`	Name of the model whose results are being aggregated (e.g., `"llm-twin-7b"`)

Outputs

Output	Type	Description
Console summary	Printed text	Per-model aggregated scores: mean accuracy and mean style, formatted to two decimal places
Persisted results	HuggingFace Hub dataset	The evaluated dataset (with all per-sample scores) remains on Hub for downstream access

Step-by-Step Behavior

Iterate over models: For each model in the evaluation configuration (typically both a fine-tuned model and a baseline), the following steps are performed
Load results dataset: The results dataset (containing generated answers and judge scores) is loaded from HuggingFace Hub using load_dataset() with split="all"
Compute mean accuracy: The "accuracy" column values are summed and divided by the number of samples
Compute mean style: The "style" column values are summed and divided by the number of samples
Print summary: The model name and aggregated scores are printed to the console in a human-readable format
Results persist on Hub: The per-sample results dataset (pushed during the scoring step) remains available on HuggingFace Hub for further analysis

Example Output

Model: llm-twin-7b
  Accuracy: 2.45
  Style:    2.31

Model: TwinLlama-3.1-8B
  Accuracy: 2.12
  Style:    2.08

This output enables quick comparison: the fine-tuned llm-twin-7b model outperforms the baseline TwinLlama-3.1-8B on both accuracy and style.

Data Flow

The aggregation step consumes data produced by the full upstream pipeline:

Column	Source	Description
`instruction`	Original dataset	The prompt given to the model
`answers`	Batch Inference	The model's generated response
`accuracy`	LLM-as-Judge	Accuracy score (1–3)
`style`	LLM-as-Judge	Style score (1–3)
`evaluation`	LLM-as-Judge	Free-text explanation of scores

External Dependencies

Dependency	Purpose
`datasets` (HuggingFace)	Loading datasets from Hub via `load_dataset()` and publishing via `push_to_hub()`

Design Notes

Simple aggregation: Mean computation is deliberately simple. For a 1–3 scale with a modest number of samples, more sophisticated statistics (median, percentiles) add complexity without significantly improving decision quality.
Console output: Results are printed to stdout, making them visible in SageMaker Processing job logs as well as local terminal output. This dual-use approach keeps the aggregation logic environment-agnostic.
Hub persistence: The per-sample results remain on HuggingFace Hub even after aggregation. This allows anyone to recompute aggregates, perform deeper analysis, or debug individual low-scoring samples.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment