Principle:Openai Openai python Embedding Input Preparation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Embeddings |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
A text preprocessing pattern that formats input strings or token arrays for embedding model consumption within token length constraints.
Description
Embedding input preparation involves formatting text data into the correct input types accepted by the Embeddings API. Input can be a single string, a list of strings (batch), pre-tokenized integer arrays, or batches of integer arrays. Token length limits must be respected (8192 tokens for text-embedding-ada-002, 8191 for text-embedding-3-* models).
Usage
Use this principle when preparing text for embedding generation. Ensure text fits within token limits. Batch multiple texts for efficiency.
Theoretical Basis
# Input formats
input_single = "A single text string"
input_batch = ["Text 1", "Text 2", "Text 3"]
input_tokens = [15496, 1871, 995] # Pre-tokenized
input_token_batch = [[15496, 1871], [995, 1234]]
# Token limit enforcement
if count_tokens(text) > MAX_TOKENS:
text = truncate_to_tokens(text, MAX_TOKENS)