Heuristic:Pytorch Serve CPU Performance Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, CPU_Inference |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
CPU inference optimization via core pinning (10-30% improvement), ONNX Runtime threading, `torch.inference_mode()`, and sequence bucketing for NLP.
Description
CPU inference performance in TorchServe can be significantly improved through several techniques. Core pinning (`cpu_launcher_enable=true`) prevents thread migration across CPU cores and NUMA nodes, dramatically reducing cache misses. ONNX Runtime sessions should set `intra_op_num_threads` to the logical CPU count. Always use `torch.inference_mode()` instead of `torch.no_grad()` for inference, as it provides additional optimizations. For NLP models with variable-length sequences, avoid over-padding to constant length (e.g., 512 tokens) and use sequence bucketing to reduce wasted computation.
Usage
Apply this heuristic when running TorchServe on CPU-only deployments or when CPU pre/post-processing is a bottleneck. Particularly important for multi-socket NUMA servers where cross-socket memory access can halve performance.
The Insight (Rule of Thumb)
- Core Pinning: Set `cpu_launcher_enable=true` and `cpu_launcher_args=--use_logical_core` in `config.properties`. Provides 10-30% improvement on NUMA systems.
- ONNX Runtime: Set `intra_op_num_threads` to `psutil.cpu_count(logical=True)` for CPU sessions.
- torch.inference_mode(): Use instead of `torch.no_grad()`. Disables view tracking and version counter bumps.
- Avoid over-padding: NLP tokenizers padding to constant 512 tokens run orders of magnitude slower on short sequences.
- Sequence bucketing: Sort inputs by length before batching. Can improve throughput by 2x.
- INT8 quantization: Consider for CPU inference with acceptable accuracy loss. Use Intel Neural Compressor for advanced quantization.
- Trade-off: Core pinning reduces flexibility for other processes; quantization may reduce accuracy.
Reasoning
Modern multi-socket servers have Non-Uniform Memory Access (NUMA) architecture where each CPU socket has its own local memory. When a thread on Socket 0 accesses memory owned by Socket 1, the access takes 2-3x longer. Core pinning ensures all threads for a model worker stay on the same socket, accessing only local memory.
Hyperthreading doubles the number of "logical" cores but each pair shares the same physical execution units. For compute-bound inference, two threads on the same physical core compete for resources. The `--use_logical_core` flag controls this behavior.
`torch.inference_mode()` is strictly superior to `torch.no_grad()` for inference because it additionally disables autograd's internal bookkeeping (version counters, view tracking) that has no purpose during inference.
For NLP models, padding all sequences to max_length (commonly 512) means a 10-token input wastes 98% of computation on padding tokens. Sorting by length and batching similar-length sequences together minimizes this waste.
Code Evidence
Core pinning configuration from `docs/performance_guide.md:65-68`:
cpu_launcher_enable=true
cpu_launcher_args=--use_logical_core
ONNX Runtime threading from `ts/torch_handler/base_handler.py:108-109`:
sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = psutil.cpu_count(logical=True)
torch.inference_mode usage from `docs/performance_checklist.md:28`:
Using with torch.inference_mode() context before calling forward pass
on your model improves inference performance. This is achieved by
disabling view tracking and version counter bumps.
Sequence bucketing advice from `docs/performance_checklist.md:42`:
For batch processing on sequences with different lengths, sequence
bucketing could potentially improve the throughput by 2X. A simple
implementation is to sort all input by sequence length before feeding
them to the model, as this reduces unnecessary padding.