Implementation:BerriAI Litellm Completion With Fine Tuned Model

Knowledge Sources	Domains	Last Updated
BerriAI/litellm	Model Inference, Fine-Tuned Model Deployment, API Integration	2026-02-15

Overview

Concrete tool for running inference on fine-tuned models through the standard unified completion interface provided by LiteLLM.

Description

LiteLLM's completion() function serves as the universal entry point for invoking any supported language model, including fine-tuned models. After a fine-tuning job succeeds and produces a model identifier (available via job.fine_tuned_model), that identifier is passed as the model parameter to completion(). The function automatically resolves the provider, handles authentication, transforms the request to the provider-specific format, and returns a normalized ModelResponse object. No separate API or special handling is required for fine-tuned models -- they are used through exactly the same interface as base models.

The completion() function supports the full range of OpenAI-compatible parameters including temperature, top_p, max_tokens, streaming, function calling, tool use, logprobs, and response formatting. All of these features work identically with fine-tuned models.

Usage

Use completion() with a fine-tuned model when:

A fine-tuning job has completed and the fine_tuned_model field is available.
Integrating fine-tuned model inference into existing application code.
Comparing fine-tuned model output against base model output.
Deploying fine-tuned models in production through the LiteLLM proxy or direct API calls.

Code Reference

Source Location

litellm/main.py (lines 999-7447)

Signature

def completion(
    model: str,
    messages: List = [],
    timeout: Optional[Union[float, str, httpx.Timeout]] = None,
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    n: Optional[int] = None,
    stream: Optional[bool] = None,
    stream_options: Optional[dict] = None,
    stop=None,
    max_completion_tokens: Optional[int] = None,
    max_tokens: Optional[int] = None,
    modalities: Optional[List[ChatCompletionModality]] = None,
    prediction: Optional[ChatCompletionPredictionContentParam] = None,
    audio: Optional[ChatCompletionAudioParam] = None,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    logit_bias: Optional[dict] = None,
    user: Optional[str] = None,
    reasoning_effort: Optional[Literal[
        "none", "minimal", "low", "medium", "high", "xhigh", "default"
    ]] = None,
    response_format: Optional[Union[dict, Type[BaseModel]]] = None,
    seed: Optional[int] = None,
    tools: Optional[List] = None,
    tool_choice: Optional[Union[str, dict]] = None,
    logprobs: Optional[bool] = None,
    top_logprobs: Optional[int] = None,
    parallel_tool_calls: Optional[bool] = None,
    deployment_id=None,
    extra_headers: Optional[dict] = None,
    functions: Optional[List] = None,
    function_call: Optional[str] = None,
    base_url: Optional[str] = None,
    api_version: Optional[str] = None,
    api_key: Optional[str] = None,
    model_list: Optional[list] = None,
    **kwargs,
) -> Union[ModelResponse, CustomStreamWrapper]:

Import

import litellm
from litellm import completion

I/O Contract

Inputs

Parameter	Type	Required	Description
model	`str`	Yes	The model identifier. For fine-tuned models, use the identifier from the completed job's `fine_tuned_model` field (e.g., "ft:gpt-3.5-turbo:my-org:suffix:id"). For Azure, use "azure/deployment-name". For Vertex AI, use the tuned model resource path.
messages	`List`	Yes	A list of message dicts with "role" and "content" keys representing the conversation.
temperature	`Optional[float]`	No	Sampling temperature (0.0 to 2.0). Lower values produce more deterministic output.
top_p	`Optional[float]`	No	Nucleus sampling parameter.
max_tokens	`Optional[int]`	No	Maximum number of tokens in the completion.
max_completion_tokens	`Optional[int]`	No	Maximum number of completion tokens (newer API parameter).
stream	`Optional[bool]`	No	If True, returns a streaming response.
stop	various	No	Stop sequence(s) where the model should stop generating.
tools	`Optional[List]`	No	List of tool/function definitions for function calling.
tool_choice	`Optional[Union[str, dict]]`	No	Controls which tool the model calls.
seed	`Optional[int]`	No	Seed for deterministic output.
api_key	`Optional[str]`	No	Override API key for this request.
base_url	`Optional[str]`	No	Override API base URL for this request.
**kwargs	various	No	Additional provider-specific parameters.

Outputs

Return Type	Description
`ModelResponse`	A normalized response object containing: id, choices (list of completion choices with message containing role and content), created, model (the model identifier used), usage (prompt_tokens, completion_tokens, total_tokens). Identical structure regardless of whether the model is a base model or fine-tuned model.
`CustomStreamWrapper`	Returned when `stream=True`. An iterable yielding streamed completion chunks.

Usage Examples

Basic completion with an OpenAI fine-tuned model

import litellm

# Use the fine-tuned model identifier from a completed job
response = litellm.completion(
    model="ft:gpt-3.5-turbo:my-org:custom-suffix:abc123",
    messages=[
        {"role": "system", "content": "You are a domain expert assistant."},
        {"role": "user", "content": "Explain the key findings from the Q3 report."},
    ],
    temperature=0.7,
    max_tokens=500,
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Streaming completion with a fine-tuned model

import litellm

response = litellm.completion(
    model="ft:gpt-4o-mini-2024-07-18:my-org:specialist:xyz789",
    messages=[
        {"role": "user", "content": "Summarize the contract terms."},
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Using an Azure fine-tuned model deployment

import litellm

response = litellm.completion(
    model="azure/my-fine-tuned-deployment",
    messages=[
        {"role": "user", "content": "Classify this support ticket."},
    ],
    api_base="https://my-resource.openai.azure.com/",
    api_key="my-azure-api-key",
    api_version="2024-02-01",
)

print(response.choices[0].message.content)

End-to-end: from job completion to inference

import litellm
from litellm.fine_tuning.main import retrieve_fine_tuning_job

# Check that the job has completed
job = retrieve_fine_tuning_job(
    fine_tuning_job_id="ftjob-abc123",
    custom_llm_provider="openai",
)

if job.status == "succeeded" and job.fine_tuned_model:
    # Use the fine-tuned model for inference
    response = litellm.completion(
        model=job.fine_tuned_model,
        messages=[
            {"role": "user", "content": "Generate a product description."},
        ],
    )
    print(response.choices[0].message.content)
else:
    print(f"Job not ready. Status: {job.status}")

Related Pages

Principle:BerriAI_Litellm_Fine_Tuned_Model_Usage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment