Heuristic:Pytorch Serve CPU Performance Tuning

Knowledge Sources	Pytorch_Serve Intel CPU Performance
Domains	Optimization, CPU_Inference
Last Updated	2026-02-13 00:00 GMT

Overview

CPU inference optimization via core pinning (10-30% improvement), ONNX Runtime threading, `torch.inference_mode()`, and sequence bucketing for NLP.

Description

CPU inference performance in TorchServe can be significantly improved through several techniques. Core pinning (`cpu_launcher_enable=true`) prevents thread migration across CPU cores and NUMA nodes, dramatically reducing cache misses. ONNX Runtime sessions should set `intra_op_num_threads` to the logical CPU count. Always use `torch.inference_mode()` instead of `torch.no_grad()` for inference, as it provides additional optimizations. For NLP models with variable-length sequences, avoid over-padding to constant length (e.g., 512 tokens) and use sequence bucketing to reduce wasted computation.

Usage

Apply this heuristic when running TorchServe on CPU-only deployments or when CPU pre/post-processing is a bottleneck. Particularly important for multi-socket NUMA servers where cross-socket memory access can halve performance.

The Insight (Rule of Thumb)

Core Pinning: Set `cpu_launcher_enable=true` and `cpu_launcher_args=--use_logical_core` in `config.properties`. Provides 10-30% improvement on NUMA systems.
ONNX Runtime: Set `intra_op_num_threads` to `psutil.cpu_count(logical=True)` for CPU sessions.
torch.inference_mode(): Use instead of `torch.no_grad()`. Disables view tracking and version counter bumps.
Avoid over-padding: NLP tokenizers padding to constant 512 tokens run orders of magnitude slower on short sequences.
Sequence bucketing: Sort inputs by length before batching. Can improve throughput by 2x.
INT8 quantization: Consider for CPU inference with acceptable accuracy loss. Use Intel Neural Compressor for advanced quantization.
Trade-off: Core pinning reduces flexibility for other processes; quantization may reduce accuracy.

Reasoning

Modern multi-socket servers have Non-Uniform Memory Access (NUMA) architecture where each CPU socket has its own local memory. When a thread on Socket 0 accesses memory owned by Socket 1, the access takes 2-3x longer. Core pinning ensures all threads for a model worker stay on the same socket, accessing only local memory.

Hyperthreading doubles the number of "logical" cores but each pair shares the same physical execution units. For compute-bound inference, two threads on the same physical core compete for resources. The `--use_logical_core` flag controls this behavior.

`torch.inference_mode()` is strictly superior to `torch.no_grad()` for inference because it additionally disables autograd's internal bookkeeping (version counters, view tracking) that has no purpose during inference.

For NLP models, padding all sequences to max_length (commonly 512) means a 10-token input wastes 98% of computation on padding tokens. Sorting by length and batching similar-length sequences together minimizes this waste.

Code Evidence

Core pinning configuration from `docs/performance_guide.md:65-68`:

cpu_launcher_enable=true
cpu_launcher_args=--use_logical_core

ONNX Runtime threading from `ts/torch_handler/base_handler.py:108-109`:

sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = psutil.cpu_count(logical=True)

torch.inference_mode usage from `docs/performance_checklist.md:28`:

Using with torch.inference_mode() context before calling forward pass
on your model improves inference performance. This is achieved by
disabling view tracking and version counter bumps.

Sequence bucketing advice from `docs/performance_checklist.md:42`:

For batch processing on sequences with different lengths, sequence
bucketing could potentially improve the throughput by 2X. A simple
implementation is to sort all input by sequence length before feeding
them to the model, as this reduces unnecessary padding.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment