Implementation:BerriAI Litellm Completion With Fine Tuned Model
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| BerriAI/litellm | Model Inference, Fine-Tuned Model Deployment, API Integration | 2026-02-15 |
Overview
Concrete tool for running inference on fine-tuned models through the standard unified completion interface provided by LiteLLM.
Description
LiteLLM's completion() function serves as the universal entry point for invoking any supported language model, including fine-tuned models. After a fine-tuning job succeeds and produces a model identifier (available via job.fine_tuned_model), that identifier is passed as the model parameter to completion(). The function automatically resolves the provider, handles authentication, transforms the request to the provider-specific format, and returns a normalized ModelResponse object. No separate API or special handling is required for fine-tuned models -- they are used through exactly the same interface as base models.
The completion() function supports the full range of OpenAI-compatible parameters including temperature, top_p, max_tokens, streaming, function calling, tool use, logprobs, and response formatting. All of these features work identically with fine-tuned models.
Usage
Use completion() with a fine-tuned model when:
- A fine-tuning job has completed and the
fine_tuned_modelfield is available. - Integrating fine-tuned model inference into existing application code.
- Comparing fine-tuned model output against base model output.
- Deploying fine-tuned models in production through the LiteLLM proxy or direct API calls.
Code Reference
Source Location
litellm/main.py (lines 999-7447)
Signature
def completion(
model: str,
messages: List = [],
timeout: Optional[Union[float, str, httpx.Timeout]] = None,
temperature: Optional[float] = None,
top_p: Optional[float] = None,
n: Optional[int] = None,
stream: Optional[bool] = None,
stream_options: Optional[dict] = None,
stop=None,
max_completion_tokens: Optional[int] = None,
max_tokens: Optional[int] = None,
modalities: Optional[List[ChatCompletionModality]] = None,
prediction: Optional[ChatCompletionPredictionContentParam] = None,
audio: Optional[ChatCompletionAudioParam] = None,
presence_penalty: Optional[float] = None,
frequency_penalty: Optional[float] = None,
logit_bias: Optional[dict] = None,
user: Optional[str] = None,
reasoning_effort: Optional[Literal[
"none", "minimal", "low", "medium", "high", "xhigh", "default"
]] = None,
response_format: Optional[Union[dict, Type[BaseModel]]] = None,
seed: Optional[int] = None,
tools: Optional[List] = None,
tool_choice: Optional[Union[str, dict]] = None,
logprobs: Optional[bool] = None,
top_logprobs: Optional[int] = None,
parallel_tool_calls: Optional[bool] = None,
deployment_id=None,
extra_headers: Optional[dict] = None,
functions: Optional[List] = None,
function_call: Optional[str] = None,
base_url: Optional[str] = None,
api_version: Optional[str] = None,
api_key: Optional[str] = None,
model_list: Optional[list] = None,
**kwargs,
) -> Union[ModelResponse, CustomStreamWrapper]:
Import
import litellm
from litellm import completion
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | str |
Yes | The model identifier. For fine-tuned models, use the identifier from the completed job's fine_tuned_model field (e.g., "ft:gpt-3.5-turbo:my-org:suffix:id"). For Azure, use "azure/deployment-name". For Vertex AI, use the tuned model resource path.
|
| messages | List |
Yes | A list of message dicts with "role" and "content" keys representing the conversation. |
| temperature | Optional[float] |
No | Sampling temperature (0.0 to 2.0). Lower values produce more deterministic output. |
| top_p | Optional[float] |
No | Nucleus sampling parameter. |
| max_tokens | Optional[int] |
No | Maximum number of tokens in the completion. |
| max_completion_tokens | Optional[int] |
No | Maximum number of completion tokens (newer API parameter). |
| stream | Optional[bool] |
No | If True, returns a streaming response. |
| stop | various | No | Stop sequence(s) where the model should stop generating. |
| tools | Optional[List] |
No | List of tool/function definitions for function calling. |
| tool_choice | Optional[Union[str, dict]] |
No | Controls which tool the model calls. |
| seed | Optional[int] |
No | Seed for deterministic output. |
| api_key | Optional[str] |
No | Override API key for this request. |
| base_url | Optional[str] |
No | Override API base URL for this request. |
| **kwargs | various | No | Additional provider-specific parameters. |
Outputs
| Return Type | Description |
|---|---|
ModelResponse |
A normalized response object containing: id, choices (list of completion choices with message containing role and content), created, model (the model identifier used), usage (prompt_tokens, completion_tokens, total_tokens). Identical structure regardless of whether the model is a base model or fine-tuned model. |
CustomStreamWrapper |
Returned when stream=True. An iterable yielding streamed completion chunks.
|
Usage Examples
Basic completion with an OpenAI fine-tuned model
import litellm
# Use the fine-tuned model identifier from a completed job
response = litellm.completion(
model="ft:gpt-3.5-turbo:my-org:custom-suffix:abc123",
messages=[
{"role": "system", "content": "You are a domain expert assistant."},
{"role": "user", "content": "Explain the key findings from the Q3 report."},
],
temperature=0.7,
max_tokens=500,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
Streaming completion with a fine-tuned model
import litellm
response = litellm.completion(
model="ft:gpt-4o-mini-2024-07-18:my-org:specialist:xyz789",
messages=[
{"role": "user", "content": "Summarize the contract terms."},
],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Using an Azure fine-tuned model deployment
import litellm
response = litellm.completion(
model="azure/my-fine-tuned-deployment",
messages=[
{"role": "user", "content": "Classify this support ticket."},
],
api_base="https://my-resource.openai.azure.com/",
api_key="my-azure-api-key",
api_version="2024-02-01",
)
print(response.choices[0].message.content)
End-to-end: from job completion to inference
import litellm
from litellm.fine_tuning.main import retrieve_fine_tuning_job
# Check that the job has completed
job = retrieve_fine_tuning_job(
fine_tuning_job_id="ftjob-abc123",
custom_llm_provider="openai",
)
if job.status == "succeeded" and job.fine_tuned_model:
# Use the fine-tuned model for inference
response = litellm.completion(
model=job.fine_tuned_model,
messages=[
{"role": "user", "content": "Generate a product description."},
],
)
print(response.choices[0].message.content)
else:
print(f"Job not ready. Status: {job.status}")