Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai HuggingFace LLM

From Leeroopedia


Knowledge Sources
Domains Machine Learning, NLP, LLM, Text Generation, Transformers
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for running LLM inference through Hugging Face Transformers pipelines with auto-detection of model type, vision support, and streaming provided by txtai.

Description

This module provides two main classes: HFGeneration and HFLLM. HFGeneration extends the Generation base class and delegates inference to HFLLM, which wraps a Hugging Face Transformers pipeline. HFLLM auto-detects whether a model is text-generation, sequence-to-sequence, or image-text-to-text (vision), and configures the pipeline accordingly. It supports chat templates, vision models, streaming via TextIteratorStreamer, stop strings, pad token auto-configuration, and batched inference. The module also provides convenience subclasses Generator (causal LM) and Sequences (seq2seq).

Usage

Use the HuggingFace LLM backend when you want to run local Hugging Face Transformers models for text generation. This is the default backend when a model path points to a Hugging Face Hub model or a local Transformers model directory.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/pipeline/llm/huggingface.py

Signature

class HFGeneration(Generation):
    def __init__(self, path, template=None, **kwargs)
    def ischat(self)
    def isvision(self)
    def stream(self, texts, maxlength, stream, stop, **kwargs)

class HFLLM(HFPipeline):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, task=None, **kwargs)
    def __call__(self, text, prefix=None, maxlength=512, workers=0, stream=False, stop=None, **kwargs)
    def ischat(self)
    def isvision(self)
    def parameters(self, texts, maxlength, workers, stop, **kwargs)
    def extract(self, result)
    def task(self, path, task, **kwargs)

class Generator(HFLLM):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs)

class Sequences(HFLLM):
    def __init__(self, path=None, quantize=False, gpu=True, model=None, **kwargs)

class StreamingResponse:
    def __init__(self, pipeline, texts, stop, **kwargs)
    def __call__(self)
    def __iter__(self)

Import

from txtai.pipeline.llm.huggingface import HFGeneration, HFLLM, Generator, Sequences

I/O Contract

Inputs

Name Type Required Description
path str No Model path; accepts Hugging Face model hub id or local path. Auto-detects task type.
quantize bool No If True, quantizes the model to int8 (CPU only). Defaults to False.
gpu bool or int No True/False to enable GPU, or a specific GPU device id. Defaults to True.
model Pipeline No Optional existing pipeline model to wrap.
task str No Explicit task name (e.g. "language-generation", "sequence-sequence", "vision"). Auto-detected if not provided.
text str or list Yes (call) Input text, list of strings, or list of chat message dicts.
prefix str No Optional prefix prepended to each text element.
maxlength int No Maximum sequence length. Defaults to 512.
workers int No Number of concurrent workers for data processing. Defaults to 0.
stream bool No Stream response token-by-token if True. Defaults to False.
stop list No List of stop strings to halt generation.
kwargs dict No Additional generation keyword arguments.

Outputs

Name Type Description
result str or list or generator Generated text as a string for single inputs, a list of strings for batch inputs, or a streaming generator when stream=True.

Usage Examples

from txtai.pipeline.llm.huggingface import HFGeneration, Generator, Sequences

# Using HFGeneration (auto-detect model type)
llm = HFGeneration("meta-llama/Meta-Llama-3-8B-Instruct", gpu=True)

# Generate text
results = llm.stream(
    [[{"role": "user", "content": "Explain gravity"}]],
    maxlength=512, stream=False, stop=None
)
for result in results:
    print(result)

# Using Generator for causal language models
gen = Generator("gpt2", gpu=False)
output = gen("Once upon a time", maxlength=100)

# Using Sequences for sequence-to-sequence models
seq = Sequences("google/flan-t5-small", gpu=False)
output = seq("Translate English to French: Hello world")

# Streaming generation
gen = Generator("gpt2", gpu=False)
for token in gen("The meaning of life is", stream=True):
    print(token, end="")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment