Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:BerriAI Litellm Completion With Fine Tuned Model

From Leeroopedia
Knowledge Sources Domains Last Updated
BerriAI/litellm Model Inference, Fine-Tuned Model Deployment, API Integration 2026-02-15

Overview

Concrete tool for running inference on fine-tuned models through the standard unified completion interface provided by LiteLLM.

Description

LiteLLM's completion() function serves as the universal entry point for invoking any supported language model, including fine-tuned models. After a fine-tuning job succeeds and produces a model identifier (available via job.fine_tuned_model), that identifier is passed as the model parameter to completion(). The function automatically resolves the provider, handles authentication, transforms the request to the provider-specific format, and returns a normalized ModelResponse object. No separate API or special handling is required for fine-tuned models -- they are used through exactly the same interface as base models.

The completion() function supports the full range of OpenAI-compatible parameters including temperature, top_p, max_tokens, streaming, function calling, tool use, logprobs, and response formatting. All of these features work identically with fine-tuned models.

Usage

Use completion() with a fine-tuned model when:

  • A fine-tuning job has completed and the fine_tuned_model field is available.
  • Integrating fine-tuned model inference into existing application code.
  • Comparing fine-tuned model output against base model output.
  • Deploying fine-tuned models in production through the LiteLLM proxy or direct API calls.

Code Reference

Source Location

litellm/main.py (lines 999-7447)

Signature

def completion(
    model: str,
    messages: List = [],
    timeout: Optional[Union[float, str, httpx.Timeout]] = None,
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    n: Optional[int] = None,
    stream: Optional[bool] = None,
    stream_options: Optional[dict] = None,
    stop=None,
    max_completion_tokens: Optional[int] = None,
    max_tokens: Optional[int] = None,
    modalities: Optional[List[ChatCompletionModality]] = None,
    prediction: Optional[ChatCompletionPredictionContentParam] = None,
    audio: Optional[ChatCompletionAudioParam] = None,
    presence_penalty: Optional[float] = None,
    frequency_penalty: Optional[float] = None,
    logit_bias: Optional[dict] = None,
    user: Optional[str] = None,
    reasoning_effort: Optional[Literal[
        "none", "minimal", "low", "medium", "high", "xhigh", "default"
    ]] = None,
    response_format: Optional[Union[dict, Type[BaseModel]]] = None,
    seed: Optional[int] = None,
    tools: Optional[List] = None,
    tool_choice: Optional[Union[str, dict]] = None,
    logprobs: Optional[bool] = None,
    top_logprobs: Optional[int] = None,
    parallel_tool_calls: Optional[bool] = None,
    deployment_id=None,
    extra_headers: Optional[dict] = None,
    functions: Optional[List] = None,
    function_call: Optional[str] = None,
    base_url: Optional[str] = None,
    api_version: Optional[str] = None,
    api_key: Optional[str] = None,
    model_list: Optional[list] = None,
    **kwargs,
) -> Union[ModelResponse, CustomStreamWrapper]:

Import

import litellm
from litellm import completion

I/O Contract

Inputs

Parameter Type Required Description
model str Yes The model identifier. For fine-tuned models, use the identifier from the completed job's fine_tuned_model field (e.g., "ft:gpt-3.5-turbo:my-org:suffix:id"). For Azure, use "azure/deployment-name". For Vertex AI, use the tuned model resource path.
messages List Yes A list of message dicts with "role" and "content" keys representing the conversation.
temperature Optional[float] No Sampling temperature (0.0 to 2.0). Lower values produce more deterministic output.
top_p Optional[float] No Nucleus sampling parameter.
max_tokens Optional[int] No Maximum number of tokens in the completion.
max_completion_tokens Optional[int] No Maximum number of completion tokens (newer API parameter).
stream Optional[bool] No If True, returns a streaming response.
stop various No Stop sequence(s) where the model should stop generating.
tools Optional[List] No List of tool/function definitions for function calling.
tool_choice Optional[Union[str, dict]] No Controls which tool the model calls.
seed Optional[int] No Seed for deterministic output.
api_key Optional[str] No Override API key for this request.
base_url Optional[str] No Override API base URL for this request.
**kwargs various No Additional provider-specific parameters.

Outputs

Return Type Description
ModelResponse A normalized response object containing: id, choices (list of completion choices with message containing role and content), created, model (the model identifier used), usage (prompt_tokens, completion_tokens, total_tokens). Identical structure regardless of whether the model is a base model or fine-tuned model.
CustomStreamWrapper Returned when stream=True. An iterable yielding streamed completion chunks.

Usage Examples

Basic completion with an OpenAI fine-tuned model

import litellm

# Use the fine-tuned model identifier from a completed job
response = litellm.completion(
    model="ft:gpt-3.5-turbo:my-org:custom-suffix:abc123",
    messages=[
        {"role": "system", "content": "You are a domain expert assistant."},
        {"role": "user", "content": "Explain the key findings from the Q3 report."},
    ],
    temperature=0.7,
    max_tokens=500,
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Streaming completion with a fine-tuned model

import litellm

response = litellm.completion(
    model="ft:gpt-4o-mini-2024-07-18:my-org:specialist:xyz789",
    messages=[
        {"role": "user", "content": "Summarize the contract terms."},
    ],
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Using an Azure fine-tuned model deployment

import litellm

response = litellm.completion(
    model="azure/my-fine-tuned-deployment",
    messages=[
        {"role": "user", "content": "Classify this support ticket."},
    ],
    api_base="https://my-resource.openai.azure.com/",
    api_key="my-azure-api-key",
    api_version="2024-02-01",
)

print(response.choices[0].message.content)

End-to-end: from job completion to inference

import litellm
from litellm.fine_tuning.main import retrieve_fine_tuning_job

# Check that the job has completed
job = retrieve_fine_tuning_job(
    fine_tuning_job_id="ftjob-abc123",
    custom_llm_provider="openai",
)

if job.status == "succeeded" and job.fine_tuned_model:
    # Use the fine-tuned model for inference
    response = litellm.completion(
        model=job.fine_tuned_model,
        messages=[
            {"role": "user", "content": "Generate a product description."},
        ],
    )
    print(response.choices[0].message.content)
else:
    print(f"Job not ready. Status: {job.status}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment