Implementation:FMInference FlexLLMGen AutoTokenizer Usage

Metadata

Field	Value
Sources	FlexLLMGen\|https://github.com/FMInference/FlexLLMGen, Doc\|HuggingFace Transformers\|https://huggingface.co/docs/transformers
Domains	NLP, Text_Processing
Last updated	2026-02-09 00:00 GMT

Overview

Wrapper documentation for HuggingFace AutoTokenizer as configured and used by FlexLLMGen for OPT model inference.

Description

This is a Wrapper Doc for HuggingFace's AutoTokenizer. FlexLLMGen configures it with padding_side="left" and add_bos_token=False for OPT decoder-only models. The tokenizer is used for: (1) encoding prompts to input_ids with padding="max_length", (2) obtaining stop token IDs (e.g., newline), and (3) decoding output_ids back to text with skip_special_tokens=True.

External Reference

HuggingFace Tokenizer Documentation

Code Reference

Source: flexllmgen/apps/completion.py, Lines: 45-60 (usage pattern)
FlexLLMGen-specific configuration:

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", padding_side="left")
tokenizer.add_bos_token = False

Import:

from transformers import AutoTokenizer

I/O Contract

from_pretrained() Inputs

Name	Type	Required	Description
name	str	Yes	HuggingFace model name
padding_side	str	No	"left" for decoder models

call() Inputs

Name	Type	Required	Description
prompts	List[str]	Yes	Text prompts
padding	str	No	"max_length"
max_length	int	No	Sequence length

batch_decode() Inputs

Name	Type	Required	Description
output_ids	np.ndarray	Yes	Token IDs from generation
skip_special_tokens	bool	No	Strip special tokens (default True)

Outputs

from_pretrained returns AutoTokenizer
__call__ returns BatchEncoding with input_ids
batch_decode returns List[str]

Usage Examples

from transformers import AutoTokenizer

# Load with FlexLLMGen configuration
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-30b", padding_side="left")
tokenizer.add_bos_token = False

# Tokenize prompts
prompts = ["Question: What is AI?\nAnswer:"]
inputs = tokenizer(prompts, padding="max_length", max_length=128)
# inputs.input_ids: List[List[int]] padded from left

# Get stop token (newline)
stop = tokenizer("\n").input_ids[0]

# Decode outputs
output_ids = model.generate(inputs.input_ids, max_new_tokens=32, stop=stop)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

Related Pages

Principle:FMInference_FlexLLMGen_Tokenizer_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment