Heuristic:Cohere ai Cohere python Embed Auto Batching Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Text_Embedding |
| Last Updated | 2026-02-15 14:00 GMT |
Overview
The SDK automatically splits large embedding requests into batches of 96 texts and processes them in parallel using a 64-thread pool, then merges results transparently.
Description
When calling `Client.embed()` with more than 96 text inputs, the SDK automatically divides the input into chunks of `embed_batch_size` (96) and dispatches them concurrently via a `ThreadPoolExecutor` with 64 worker threads. The individual responses are then merged back into a single `EmbedResponse` using `merge_embed_responses()`. This batching is enabled by default (`batching=True`) but can be disabled. Image embeddings bypass batching entirely.
Usage
Apply this knowledge when embedding large text collections (>96 items) through the Cohere SDK. The batching is automatic and requires no configuration. Be aware of it when:
- Debugging rate limit errors on large embed calls (each batch is a separate API request)
- Tuning `thread_pool_executor` size for your deployment
- Working with image embeddings (batching is skipped for images)
The Insight (Rule of Thumb)
- Action: Let the SDK handle batching automatically; adjust `thread_pool_executor` worker count if needed.
- Value: `embed_batch_size = 96` texts per batch, `ThreadPoolExecutor(64)` default workers.
- Trade-off: Parallel batching improves throughput for large inputs but creates multiple API requests, each counting toward rate limits. Disabling batching (`batching=False`) sends one large request.
- Exception: Image embeddings are never batched (`if images is not OMIT` skips batching).
Reasoning
The Cohere embed API has per-request limits on input size. By splitting into 96-item batches and parallelizing with 64 threads, the SDK maximizes throughput while staying within API constraints. The batch size of 96 was chosen as a tuned value balancing request overhead against payload size. The 64-thread pool allows high concurrency for I/O-bound HTTP requests.
Code Evidence
Batch size constant from `config.py:1`:
embed_batch_size = 96
ThreadPoolExecutor default from `client.py:143`:
thread_pool_executor: ThreadPoolExecutor = ThreadPoolExecutor(64),
Auto-batching logic from `client.py:192-224`:
def embed(self, *, texts=..., images=..., batching=True, ...) -> EmbedResponse:
# skip batching for images for now
if batching is False or images is not OMIT:
return BaseCohere.embed(self, texts=texts, ...)
textsarr = texts if texts is not OMIT and texts is not None else []
texts_batches = [textsarr[i : i + embed_batch_size]
for i in range(0, len(textsarr), embed_batch_size)]
responses = [
response
for response in self._executor.map(
lambda text_batch: BaseCohere.embed(self, texts=text_batch, ...),
texts_batches,
)
]
return merge_embed_responses(responses)