Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index Metadata Extractors

From Leeroopedia
Knowledge Sources
Domains Metadata Extraction, RAG, NLP
Last Updated 2026-02-11 19:00 GMT

Overview

Implements five concrete LLM-based metadata extractors -- TitleExtractor, KeywordExtractor, QuestionsAnsweredExtractor, SummaryExtractor, and PydanticProgramExtractor -- each enriching document nodes with different types of metadata to improve RAG retrieval quality.

Description

This module provides the primary collection of metadata enrichment tools in LlamaIndex. All five extractors extend BaseExtractor and use LLM-based generation with async parallel processing via run_jobs. Each extractor targets a specific type of metadata:

  • TitleExtractor -- Extracts document-level titles by gathering title candidates from the first N nodes of each document, then combining them into a single document_title field
  • KeywordExtractor -- Generates excerpt_keywords for each individual node using a configurable keyword count
  • QuestionsAnsweredExtractor -- Produces questions_this_excerpt_can_answer metadata, generating questions that are uniquely answerable by each chunk
  • SummaryExtractor -- Creates section_summary and optionally prev_section_summary/next_section_summary for adjacent node context sharing
  • PydanticProgramExtractor -- Uses a BasePydanticProgram to extract structured data conforming to a user-defined Pydantic model

All extractors accept customizable prompt templates and LLM instances, with sensible defaults provided. They support the deprecated llm_predictor parameter for backward compatibility.

Usage

Use these extractors in an ingestion pipeline to enrich nodes with metadata before indexing. They improve retrieval accuracy by adding semantic signals (titles, keywords, summaries, questions) that help vector stores and keyword-based retrievers find relevant chunks. Choose the appropriate extractor based on the type of metadata signal most valuable for your use case.

Code Reference

Source Location

  • Repository: Run_llama_Llama_index
  • File: llama-index-core/llama_index/core/extractors/metadata_extractors.py
  • Lines: 1-535

Import

from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    QuestionsAnsweredExtractor,
    SummaryExtractor,
    PydanticProgramExtractor,
)

TitleExtractor

Signature

class TitleExtractor(BaseExtractor):
    is_text_node_only: bool = False
    llm: SerializeAsAny[LLM]
    nodes: int = Field(default=5, gt=0)
    node_template: str = Field(default=DEFAULT_TITLE_NODE_TEMPLATE)
    combine_template: str = Field(default=DEFAULT_TITLE_COMBINE_TEMPLATE)

    def __init__(
        self,
        llm: Optional[LLM] = None,
        llm_predictor: Optional[LLM] = None,
        nodes: int = 5,
        node_template: str = DEFAULT_TITLE_NODE_TEMPLATE,
        combine_template: str = DEFAULT_TITLE_COMBINE_TEMPLATE,
        num_workers: int = DEFAULT_NUM_WORKERS,
        **kwargs: Any,
    ) -> None

Inputs

Name Type Required Description
llm Optional[LLM] No Language model for generation; defaults to Settings.llm
nodes int No Number of nodes from the front of each document to use for title extraction (default: 5)
node_template str No Prompt template for extracting per-node title clues
combine_template str No Prompt template for combining node-level clues into a document title
num_workers int No Number of parallel workers (default: DEFAULT_NUM_WORKERS)

Output

Name Type Description
document_title str Inferred document title stored in each node's metadata

Key Methods

  • aextract(nodes) -- Groups nodes by ref_doc_id, extracts titles per document, returns metadata dicts
  • separate_nodes_by_ref_id(nodes) -- Groups nodes by document reference, taking at most N nodes per document
  • extract_titles(nodes_by_doc_id) -- Generates title candidates and combines them using the LLM
  • get_title_candidates(nodes) -- Extracts individual title clues from each node

KeywordExtractor

Signature

class KeywordExtractor(BaseExtractor):
    llm: SerializeAsAny[LLM]
    keywords: int = Field(default=5, gt=0)
    prompt_template: str = Field(default=DEFAULT_KEYWORD_EXTRACT_TEMPLATE)

    def __init__(
        self,
        llm: Optional[LLM] = None,
        llm_predictor: Optional[LLM] = None,
        keywords: int = 5,
        prompt_template: str = DEFAULT_KEYWORD_EXTRACT_TEMPLATE,
        num_workers: int = DEFAULT_NUM_WORKERS,
        **kwargs: Any,
    ) -> None

Inputs

Name Type Required Description
llm Optional[LLM] No Language model for generation; defaults to Settings.llm
keywords int No Number of keywords to extract per node (default: 5)
prompt_template str No Prompt template for keyword extraction
num_workers int No Number of parallel workers

Output

Name Type Description
excerpt_keywords str Comma-separated keywords stored in node metadata

QuestionsAnsweredExtractor

Signature

class QuestionsAnsweredExtractor(BaseExtractor):
    llm: SerializeAsAny[LLM]
    questions: int = Field(default=5, gt=0)
    prompt_template: str = Field(default=DEFAULT_QUESTION_GEN_TMPL)
    embedding_only: bool = Field(default=True)

    def __init__(
        self,
        llm: Optional[LLM] = None,
        llm_predictor: Optional[LLM] = None,
        questions: int = 5,
        prompt_template: str = DEFAULT_QUESTION_GEN_TMPL,
        embedding_only: bool = True,
        num_workers: int = DEFAULT_NUM_WORKERS,
        **kwargs: Any,
    ) -> None

Inputs

Name Type Required Description
llm Optional[LLM] No Language model for generation; defaults to Settings.llm
questions int No Number of questions to generate per node (default: 5)
prompt_template str No Prompt template for question generation
embedding_only bool No Whether to use metadata for embeddings only (default: True)
num_workers int No Number of parallel workers

Output

Name Type Description
questions_this_excerpt_can_answer str Generated questions stored in node metadata

SummaryExtractor

Signature

class SummaryExtractor(BaseExtractor):
    llm: SerializeAsAny[LLM]
    summaries: List[str]
    prompt_template: str = Field(default=DEFAULT_SUMMARY_EXTRACT_TEMPLATE)

    def __init__(
        self,
        llm: Optional[LLM] = None,
        llm_predictor: Optional[LLM] = None,
        summaries: List[str] = ["self"],
        prompt_template: str = DEFAULT_SUMMARY_EXTRACT_TEMPLATE,
        num_workers: int = DEFAULT_NUM_WORKERS,
        **kwargs: Any,
    ) -> None

Inputs

Name Type Required Description
llm Optional[LLM] No Language model for generation; defaults to Settings.llm
summaries List[str] No List of summary types: "self", "prev", "next" (default: ["self"])
prompt_template str No Prompt template for summary extraction
num_workers int No Number of parallel workers

Output

Name Type Description
section_summary str Summary of the current node (when "self" in summaries)
prev_section_summary str Summary of the previous node (when "prev" in summaries)
next_section_summary str Summary of the next node (when "next" in summaries)

PydanticProgramExtractor

Signature

class PydanticProgramExtractor(BaseExtractor, Generic[Model]):
    program: SerializeAsAny[BasePydanticProgram[Model]]
    input_key: str = Field(default="input")
    extract_template_str: str = Field(default=DEFAULT_EXTRACT_TEMPLATE_STR)

Inputs

Name Type Required Description
program BasePydanticProgram[Model] Yes Pydantic program that defines the extraction schema and LLM interaction
input_key str No Key used as input to the program template (default: "input")
extract_template_str str No Template string for extraction context formatting

Output

Name Type Description
(model fields) Dict[str, Any] Dictionary of all fields from the extracted Pydantic model via model_dump()

Default Prompt Templates

Constant Used By Purpose
DEFAULT_TITLE_NODE_TEMPLATE TitleExtractor Extracts title clues from individual nodes
DEFAULT_TITLE_COMBINE_TEMPLATE TitleExtractor Combines candidate titles into a final document title
DEFAULT_KEYWORD_EXTRACT_TEMPLATE KeywordExtractor Generates comma-separated keywords from node content
DEFAULT_QUESTION_GEN_TMPL QuestionsAnsweredExtractor Generates questions uniquely answerable by the node
DEFAULT_SUMMARY_EXTRACT_TEMPLATE SummaryExtractor Summarizes key topics and entities of a section
DEFAULT_EXTRACT_TEMPLATE_STR PydanticProgramExtractor Formats section content for structured extraction

Helper Entities

add_class_name Function

def add_class_name(value: Any, handler: Callable, info: Any) -> Dict[str, Any]

A serialization helper that adds a class_name field to serialized output if the value has a class_name() method.

DEFAULT_ENTITY_MAP

A dictionary mapping entity type codes (e.g., "PER", "ORG", "LOC") to human-readable category names. Covers 15 entity types including persons, organizations, locations, animals, diseases, events, and more. Uses the tomaarsen/span-marker-mbert-base-multinerd model as the default NER model.

Usage Examples

Basic Usage

from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    QuestionsAnsweredExtractor,
    SummaryExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        TitleExtractor(nodes=5),
        KeywordExtractor(keywords=10),
        QuestionsAnsweredExtractor(questions=3),
        SummaryExtractor(summaries=["self", "prev", "next"]),
    ]
)

nodes = pipeline.run(documents=documents)

PydanticProgramExtractor Usage

from llama_index.core.extractors import PydanticProgramExtractor
from pydantic import BaseModel

class EntityInfo(BaseModel):
    name: str
    description: str
    category: str

extractor = PydanticProgramExtractor(
    program=my_pydantic_program,  # BasePydanticProgram instance
    input_key="input",
)

metadata_list = await extractor.aextract(nodes)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment