Implementation:Neuml Txtai HuggingFace LLM

Knowledge Sources	Neuml_Txtai
Domains	Machine Learning, NLP, LLM, Text Generation, Transformers
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete tool for running LLM inference through Hugging Face Transformers pipelines with auto-detection of model type, vision support, and streaming provided by txtai.

Description

This module provides two main classes: HFGeneration and HFLLM. HFGeneration extends the Generation base class and delegates inference to HFLLM, which wraps a Hugging Face Transformers pipeline. HFLLM auto-detects whether a model is text-generation, sequence-to-sequence, or image-text-to-text (vision), and configures the pipeline accordingly. It supports chat templates, vision models, streaming via TextIteratorStreamer, stop strings, pad token auto-configuration, and batched inference. The module also provides convenience subclasses Generator (causal LM) and Sequences (seq2seq).

Usage

Use the HuggingFace LLM backend when you want to run local Hugging Face Transformers models for text generation. This is the default backend when a model path points to a Hugging Face Hub model or a local Transformers model directory.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/pipeline/llm/huggingface.py

Signature

class HFGeneration(Generation):
    def __init__(self, path, template=None, **kwargs)
    def ischat(self)
    def isvision(self)
    def stream(self, texts, maxlength, stream, stop, **kwargs)

class HFLLM(HFPipeline):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, task=None, **kwargs)
    def __call__(self, text, prefix=None, maxlength=512, workers=0, stream=False, stop=None, **kwargs)
    def ischat(self)
    def isvision(self)
    def parameters(self, texts, maxlength, workers, stop, **kwargs)
    def extract(self, result)
    def task(self, path, task, **kwargs)

class Generator(HFLLM):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs)

class Sequences(HFLLM):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs)

class StreamingResponse:
    def __init__(self, pipeline, texts, stop, **kwargs)
    def __call__(self)
    def __iter__(self)

Import

from txtai.pipeline.llm.huggingface import HFGeneration, HFLLM, Generator, Sequences

I/O Contract

Inputs

Name	Type	Required	Description
path	str	No	Model path; accepts Hugging Face model hub id or local path. Auto-detects task type.
quantize	bool	No	If True, quantizes the model to int8 (CPU only). Defaults to False.
gpu	bool or int	No	True/False to enable GPU, or a specific GPU device id. Defaults to True.
model	Pipeline	No	Optional existing pipeline model to wrap.
task	str	No	Explicit task name (e.g. "language-generation", "sequence-sequence", "vision"). Auto-detected if not provided.
text	str or list	Yes (call)	Input text, list of strings, or list of chat message dicts.
prefix	str	No	Optional prefix prepended to each text element.
maxlength	int	No	Maximum sequence length. Defaults to 512.
workers	int	No	Number of concurrent workers for data processing. Defaults to 0.
stream	bool	No	Stream response token-by-token if True. Defaults to False.
stop	list	No	List of stop strings to halt generation.
kwargs	dict	No	Additional generation keyword arguments.

Outputs

Name	Type	Description
result	str or list or generator	Generated text as a string for single inputs, a list of strings for batch inputs, or a streaming generator when stream=True.

Usage Examples

from txtai.pipeline.llm.huggingface import HFGeneration, Generator, Sequences

# Using HFGeneration (auto-detect model type)
llm = HFGeneration("meta-llama/Meta-Llama-3-8B-Instruct", gpu=True)

# Generate text
results = llm.stream(
    [[{"role": "user", "content": "Explain gravity"}]],
    maxlength=512, stream=False, stop=None
)
for result in results:
    print(result)

# Using Generator for causal language models
gen = Generator("gpt2", gpu=False)
output = gen("Once upon a time", maxlength=100)

# Using Sequences for sequence-to-sequence models
seq = Sequences("google/flan-t5-small", gpu=False)
output = seq("Translate English to French: Hello world")

# Streaming generation
gen = Generator("gpt2", gpu=False)
for token in gen("The meaning of life is", stream=True):
    print(token, end="")

Related Pages

Environment:Neuml_Txtai_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment