Workflow:LaurentMazare Tch rs LLM Text Generation

Knowledge Sources	tch-rs PyTorch C++ API
Domains	Deep_Learning, LLMs, Text_Generation, Rust_ML
Last Updated	2026-02-08 13:00 GMT

Overview

End-to-end process for running autoregressive text generation with a LLaMA language model implemented natively in Rust using tch-rs.

Description

This workflow demonstrates how to implement and run a large language model (LLaMA) entirely in Rust for text generation. It covers defining the full transformer architecture (RMS normalization, rotary positional embeddings, causal self-attention, SwiGLU MLP), loading pretrained weights from safetensors format with memory-mapped I/O, tokenizing input text with a SentencePiece tokenizer, and performing autoregressive sampling to generate text token by token. The implementation supports multiple model sizes (7B, 13B, 30B, 65B) and precision modes (float32, float16, bfloat16, qint8).

Usage

Execute this workflow when you want to run LLaMA inference in Rust without Python dependencies. This is suitable for embedding LLM text generation into Rust applications, for scenarios requiring low-level control over the inference loop, or for deploying on systems where Python is not available. Requires pretrained weights converted to safetensors format and a SentencePiece tokenizer file.

Execution Steps

Step 1: Convert pretrained weights

Convert the original LLaMA checkpoint (PyTorch .pth format) to safetensors format using the provided Python conversion script. This remaps parameter names to match the tch-rs model's VarStore path hierarchy and stores them in an efficient, memory-mappable format.

Key considerations:

The conversion script maps between PyTorch parameter names and tch-rs path names
Weights are stored in float16 by default to reduce file size
Safetensors format enables zero-copy memory-mapped loading

Step 2: Build the model architecture

Construct the full LLaMA transformer architecture in Rust: token embeddings, N transformer blocks (each containing RMS normalization, causal self-attention with rotary embeddings, and a SwiGLU MLP), a final RMS normalization layer, and a linear language model head. All parameters are registered under a VarStore.

What happens:

The model struct hierarchy mirrors the original LLaMA architecture
Each component registers its parameters under named VarStore paths
A configuration struct selects the model size (7B/13B/30B/65B) which determines layer count, head count, and embedding dimension

Step 3: Load weights with memory-mapped I/O

Load the safetensors weight file using memory-mapped I/O for efficient large-model loading. Each named parameter in the safetensors file is matched to its corresponding VarStore variable and copied in-place, avoiding the need to hold two copies of the full model in memory.

Key considerations:

Memory mapping avoids loading the entire file into RAM before copying
VarStore::set_kind sets the initial dtype (float16) before loading
After loading, the dtype can be changed (e.g., to bfloat16 or qint8) for inference

Step 4: Precompute rotary position embeddings

Generate the frequency tensor for rotary positional embeddings (RoPE) used in the attention layers. This is a fixed computation based on the context window size and embedding dimension, precomputed once and reused across all forward passes.

What happens:

Frequency values are computed as inverse powers of 10000
Outer product with position indices creates a [seq_len, head_dim/2] matrix
Cosine and sine components are concatenated for complex rotation application

Step 5: Tokenize the input prompt

Encode the input text prompt into a sequence of integer token IDs using a SentencePiece tokenizer. The tokenizer vocabulary must match the one used during the original model training.

Key considerations:

The tokenizer reads a JSON-format vocabulary file
Special tokens (BOS, EOS) must be handled according to the model's expectations
The token sequence length is bounded by the model's context window size (512 in the example)

Step 6: Run autoregressive generation

Generate text token by token in a loop. At each step, the current token sequence (up to the context window size) is fed through the model to produce logits for the next token position. Temperature-scaled softmax converts logits to a probability distribution, from which the next token is sampled via multinomial sampling.

What happens:

The context window slides if the sequence exceeds the maximum context size
Temperature controls randomness: lower values make output more deterministic
Each generated token is appended to the sequence and decoded for display
The no_grad_guard context disables gradient tracking for memory efficiency during inference

Execution Diagram

GitHub URL

Workflow Repository