Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Googleapis Python genai Training Dataset Preparation

From Leeroopedia
Knowledge Sources
Domains Fine_Tuning, Data_Preparation
Last Updated 2026-02-15 00:00 GMT

Overview

A structured specification of training data for supervised fine-tuning of language models, supporting both file-based and inline example formats.

Description

Training Dataset Preparation defines how training data is provided to a fine-tuning job. Data can be supplied as a reference to a JSONL file in Google Cloud Storage, a Vertex AI Multimodal Dataset resource, or inline as a list of input/output text examples. The JSONL format contains examples with input-output pairs that teach the model to produce specific outputs for given inputs. Proper dataset preparation is critical for fine-tuning quality, as the model learns to mimic the patterns in the training examples.

Usage

Use GCS URI for production fine-tuning with large datasets stored in Cloud Storage. Use inline examples for small-scale experimentation or when data is dynamically generated. The dataset format must match the base model's expected schema (typically {"text_input": "...", "output": "..."} for text models).

Theoretical Basis

Supervised fine-tuning optimizes the model on labeled examples:

L(θ)=i=1NlogPθ(yi|xi)

Where x_i are inputs and y_i are target outputs from the training dataset. The model parameters θ are updated to maximize the probability of generating the correct output for each input.

Dataset format (JSONL):

# Each line is a JSON object with input/output pair
{"text_input": "Classify: I love this product", "output": "positive"}
{"text_input": "Classify: Terrible experience", "output": "negative"}

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment