Overview
Dataset Push To Hub implements the Principle:PacktPublishing_LLM_Engineers_Handbook_Evaluation_Results_Aggregation principle by loading evaluated datasets from HuggingFace Hub, computing mean accuracy and style scores per model, and printing a summary comparison. Results are persisted on the Hub for downstream consumption.
API Signatures
# Loading evaluated results
dataset = load_dataset(repo_id, split="all") -> Dataset
# Publishing results
dataset.push_to_hub(repo_id) -> None
Key Code
# For each model in the evaluation set:
dataset = load_dataset(
f"{workspace}/{model_name}-results",
split="all"
)
avg_accuracy = sum(dataset["accuracy"]) / len(dataset["accuracy"])
avg_style = sum(dataset["style"]) / len(dataset["style"])
print(f"Model: {model_name}")
print(f" Accuracy: {avg_accuracy:.2f}")
print(f" Style: {avg_style:.2f}")
Imports
from datasets import load_dataset
Inputs
| Input |
Type |
Description
|
| Results datasets |
HuggingFace Hub datasets |
Datasets containing per-sample accuracy, style, and evaluation columns, published by the LLM-as-Judge scoring step
|
workspace |
str |
HuggingFace Hub namespace (e.g., "pauliusztin") derived from MODEL_HUGGINGFACE_WORKSPACE
|
model_name |
str |
Name of the model whose results are being aggregated (e.g., "llm-twin-7b")
|
Outputs
| Output |
Type |
Description
|
| Console summary |
Printed text |
Per-model aggregated scores: mean accuracy and mean style, formatted to two decimal places
|
| Persisted results |
HuggingFace Hub dataset |
The evaluated dataset (with all per-sample scores) remains on Hub for downstream access
|
Step-by-Step Behavior
- Iterate over models: For each model in the evaluation configuration (typically both a fine-tuned model and a baseline), the following steps are performed
- Load results dataset: The results dataset (containing generated answers and judge scores) is loaded from HuggingFace Hub using
load_dataset() with split="all"
- Compute mean accuracy: The
"accuracy" column values are summed and divided by the number of samples
- Compute mean style: The
"style" column values are summed and divided by the number of samples
- Print summary: The model name and aggregated scores are printed to the console in a human-readable format
- Results persist on Hub: The per-sample results dataset (pushed during the scoring step) remains available on HuggingFace Hub for further analysis
Example Output
Model: llm-twin-7b
Accuracy: 2.45
Style: 2.31
Model: TwinLlama-3.1-8B
Accuracy: 2.12
Style: 2.08
This output enables quick comparison: the fine-tuned llm-twin-7b model outperforms the baseline TwinLlama-3.1-8B on both accuracy and style.
Data Flow
The aggregation step consumes data produced by the full upstream pipeline:
| Column |
Source |
Description
|
instruction |
Original dataset |
The prompt given to the model
|
answers |
Batch Inference |
The model's generated response
|
accuracy |
LLM-as-Judge |
Accuracy score (1–3)
|
style |
LLM-as-Judge |
Style score (1–3)
|
evaluation |
LLM-as-Judge |
Free-text explanation of scores
|
External Dependencies
| Dependency |
Purpose
|
datasets (HuggingFace) |
Loading datasets from Hub via load_dataset() and publishing via push_to_hub()
|
Design Notes
- Simple aggregation: Mean computation is deliberately simple. For a 1–3 scale with a modest number of samples, more sophisticated statistics (median, percentiles) add complexity without significantly improving decision quality.
- Console output: Results are printed to stdout, making them visible in SageMaker Processing job logs as well as local terminal output. This dual-use approach keeps the aggregation logic environment-agnostic.
- Hub persistence: The per-sample results remain on HuggingFace Hub even after aggregation. This allows anyone to recompute aggregates, perform deeper analysis, or debug individual low-scoring samples.
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.