Implementation:Neuml Txtai HuggingFace LLM
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, NLP, LLM, Text Generation, Transformers |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for running LLM inference through Hugging Face Transformers pipelines with auto-detection of model type, vision support, and streaming provided by txtai.
Description
This module provides two main classes: HFGeneration and HFLLM. HFGeneration extends the Generation base class and delegates inference to HFLLM, which wraps a Hugging Face Transformers pipeline. HFLLM auto-detects whether a model is text-generation, sequence-to-sequence, or image-text-to-text (vision), and configures the pipeline accordingly. It supports chat templates, vision models, streaming via TextIteratorStreamer, stop strings, pad token auto-configuration, and batched inference. The module also provides convenience subclasses Generator (causal LM) and Sequences (seq2seq).
Usage
Use the HuggingFace LLM backend when you want to run local Hugging Face Transformers models for text generation. This is the default backend when a model path points to a Hugging Face Hub model or a local Transformers model directory.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File:
src/python/txtai/pipeline/llm/huggingface.py
Signature
class HFGeneration(Generation):
def __init__(self, path, template=None, **kwargs)
def ischat(self)
def isvision(self)
def stream(self, texts, maxlength, stream, stop, **kwargs)
class HFLLM(HFPipeline):
def __init__(self, path=None, quantize=False, gpu=True, model=None, task=None, **kwargs)
def __call__(self, text, prefix=None, maxlength=512, workers=0, stream=False, stop=None, **kwargs)
def ischat(self)
def isvision(self)
def parameters(self, texts, maxlength, workers, stop, **kwargs)
def extract(self, result)
def task(self, path, task, **kwargs)
class Generator(HFLLM):
def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs)
class Sequences(HFLLM):
def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs)
class StreamingResponse:
def __init__(self, pipeline, texts, stop, **kwargs)
def __call__(self)
def __iter__(self)
Import
from txtai.pipeline.llm.huggingface import HFGeneration, HFLLM, Generator, Sequences
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | No | Model path; accepts Hugging Face model hub id or local path. Auto-detects task type. |
| quantize | bool | No | If True, quantizes the model to int8 (CPU only). Defaults to False. |
| gpu | bool or int | No | True/False to enable GPU, or a specific GPU device id. Defaults to True. |
| model | Pipeline | No | Optional existing pipeline model to wrap. |
| task | str | No | Explicit task name (e.g. "language-generation", "sequence-sequence", "vision"). Auto-detected if not provided. |
| text | str or list | Yes (call) | Input text, list of strings, or list of chat message dicts. |
| prefix | str | No | Optional prefix prepended to each text element. |
| maxlength | int | No | Maximum sequence length. Defaults to 512. |
| workers | int | No | Number of concurrent workers for data processing. Defaults to 0. |
| stream | bool | No | Stream response token-by-token if True. Defaults to False. |
| stop | list | No | List of stop strings to halt generation. |
| kwargs | dict | No | Additional generation keyword arguments. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | str or list or generator | Generated text as a string for single inputs, a list of strings for batch inputs, or a streaming generator when stream=True. |
Usage Examples
from txtai.pipeline.llm.huggingface import HFGeneration, Generator, Sequences
# Using HFGeneration (auto-detect model type)
llm = HFGeneration("meta-llama/Meta-Llama-3-8B-Instruct", gpu=True)
# Generate text
results = llm.stream(
[[{"role": "user", "content": "Explain gravity"}]],
maxlength=512, stream=False, stop=None
)
for result in results:
print(result)
# Using Generator for causal language models
gen = Generator("gpt2", gpu=False)
output = gen("Once upon a time", maxlength=100)
# Using Sequences for sequence-to-sequence models
seq = Sequences("google/flan-t5-small", gpu=False)
output = seq("Translate English to French: Hello world")
# Streaming generation
gen = Generator("gpt2", gpu=False)
for token in gen("The meaning of life is", stream=True):
print(token, end="")