Implementation:Run llama Llama index Metadata Extractors

Knowledge Sources	Run_llama_Llama_index
Domains	Metadata Extraction, RAG, NLP
Last Updated	2026-02-11 19:00 GMT

Overview

Implements five concrete LLM-based metadata extractors -- TitleExtractor, KeywordExtractor, QuestionsAnsweredExtractor, SummaryExtractor, and PydanticProgramExtractor -- each enriching document nodes with different types of metadata to improve RAG retrieval quality.

Description

This module provides the primary collection of metadata enrichment tools in LlamaIndex. All five extractors extend BaseExtractor and use LLM-based generation with async parallel processing via run_jobs. Each extractor targets a specific type of metadata:

TitleExtractor -- Extracts document-level titles by gathering title candidates from the first N nodes of each document, then combining them into a single document_title field
KeywordExtractor -- Generates excerpt_keywords for each individual node using a configurable keyword count
QuestionsAnsweredExtractor -- Produces questions_this_excerpt_can_answer metadata, generating questions that are uniquely answerable by each chunk
SummaryExtractor -- Creates section_summary and optionally prev_section_summary/next_section_summary for adjacent node context sharing
PydanticProgramExtractor -- Uses a BasePydanticProgram to extract structured data conforming to a user-defined Pydantic model

All extractors accept customizable prompt templates and LLM instances, with sensible defaults provided. They support the deprecated llm_predictor parameter for backward compatibility.

Usage

Use these extractors in an ingestion pipeline to enrich nodes with metadata before indexing. They improve retrieval accuracy by adding semantic signals (titles, keywords, summaries, questions) that help vector stores and keyword-based retrievers find relevant chunks. Choose the appropriate extractor based on the type of metadata signal most valuable for your use case.

Code Reference

Source Location

Repository: Run_llama_Llama_index
File: llama-index-core/llama_index/core/extractors/metadata_extractors.py
Lines: 1-535

Import

from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    QuestionsAnsweredExtractor,
    SummaryExtractor,
    PydanticProgramExtractor,
)

TitleExtractor

Signature

class TitleExtractor(BaseExtractor):
    is_text_node_only: bool = False
    llm: SerializeAsAny[LLM]
    nodes: int = Field(default=5, gt=0)
    node_template: str = Field(default=DEFAULT_TITLE_NODE_TEMPLATE)
    combine_template: str = Field(default=DEFAULT_TITLE_COMBINE_TEMPLATE)

    def __init__(
        self,
        llm: Optional[LLM] = None,
        llm_predictor: Optional[LLM] = None,
        nodes: int = 5,
        node_template: str = DEFAULT_TITLE_NODE_TEMPLATE,
        combine_template: str = DEFAULT_TITLE_COMBINE_TEMPLATE,
        num_workers: int = DEFAULT_NUM_WORKERS,
        **kwargs: Any,
    ) -> None

Inputs

Name	Type	Required	Description
llm	Optional[LLM]	No	Language model for generation; defaults to `Settings.llm`
nodes	int	No	Number of nodes from the front of each document to use for title extraction (default: 5)
node_template	str	No	Prompt template for extracting per-node title clues
combine_template	str	No	Prompt template for combining node-level clues into a document title
num_workers	int	No	Number of parallel workers (default: DEFAULT_NUM_WORKERS)

Output

Name	Type	Description
document_title	str	Inferred document title stored in each node's metadata

Key Methods

aextract(nodes) -- Groups nodes by ref_doc_id, extracts titles per document, returns metadata dicts
separate_nodes_by_ref_id(nodes) -- Groups nodes by document reference, taking at most N nodes per document
extract_titles(nodes_by_doc_id) -- Generates title candidates and combines them using the LLM
get_title_candidates(nodes) -- Extracts individual title clues from each node

KeywordExtractor

Signature

class KeywordExtractor(BaseExtractor):
    llm: SerializeAsAny[LLM]
    keywords: int = Field(default=5, gt=0)
    prompt_template: str = Field(default=DEFAULT_KEYWORD_EXTRACT_TEMPLATE)

    def __init__(
        self,
        llm: Optional[LLM] = None,
        llm_predictor: Optional[LLM] = None,
        keywords: int = 5,
        prompt_template: str = DEFAULT_KEYWORD_EXTRACT_TEMPLATE,
        num_workers: int = DEFAULT_NUM_WORKERS,
        **kwargs: Any,
    ) -> None

Inputs

Name	Type	Required	Description
llm	Optional[LLM]	No	Language model for generation; defaults to `Settings.llm`
keywords	int	No	Number of keywords to extract per node (default: 5)
prompt_template	str	No	Prompt template for keyword extraction
num_workers	int	No	Number of parallel workers

Output

Name	Type	Description
excerpt_keywords	str	Comma-separated keywords stored in node metadata

QuestionsAnsweredExtractor

Signature

class QuestionsAnsweredExtractor(BaseExtractor):
    llm: SerializeAsAny[LLM]
    questions: int = Field(default=5, gt=0)
    prompt_template: str = Field(default=DEFAULT_QUESTION_GEN_TMPL)
    embedding_only: bool = Field(default=True)

    def __init__(
        self,
        llm: Optional[LLM] = None,
        llm_predictor: Optional[LLM] = None,
        questions: int = 5,
        prompt_template: str = DEFAULT_QUESTION_GEN_TMPL,
        embedding_only: bool = True,
        num_workers: int = DEFAULT_NUM_WORKERS,
        **kwargs: Any,
    ) -> None

Inputs

Name	Type	Required	Description
llm	Optional[LLM]	No	Language model for generation; defaults to `Settings.llm`
questions	int	No	Number of questions to generate per node (default: 5)
prompt_template	str	No	Prompt template for question generation
embedding_only	bool	No	Whether to use metadata for embeddings only (default: True)
num_workers	int	No	Number of parallel workers

Output

Name	Type	Description
questions_this_excerpt_can_answer	str	Generated questions stored in node metadata

SummaryExtractor

Signature

class SummaryExtractor(BaseExtractor):
    llm: SerializeAsAny[LLM]
    summaries: List[str]
    prompt_template: str = Field(default=DEFAULT_SUMMARY_EXTRACT_TEMPLATE)

    def __init__(
        self,
        llm: Optional[LLM] = None,
        llm_predictor: Optional[LLM] = None,
        summaries: List[str] = ["self"],
        prompt_template: str = DEFAULT_SUMMARY_EXTRACT_TEMPLATE,
        num_workers: int = DEFAULT_NUM_WORKERS,
        **kwargs: Any,
    ) -> None

Inputs

Name	Type	Required	Description
llm	Optional[LLM]	No	Language model for generation; defaults to `Settings.llm`
summaries	List[str]	No	List of summary types: `"self"`, `"prev"`, `"next"` (default: `["self"]`)
prompt_template	str	No	Prompt template for summary extraction
num_workers	int	No	Number of parallel workers

Output

Name	Type	Description
section_summary	str	Summary of the current node (when `"self"` in summaries)
prev_section_summary	str	Summary of the previous node (when `"prev"` in summaries)
next_section_summary	str	Summary of the next node (when `"next"` in summaries)

PydanticProgramExtractor

Signature

class PydanticProgramExtractor(BaseExtractor, Generic[Model]):
    program: SerializeAsAny[BasePydanticProgram[Model]]
    input_key: str = Field(default="input")
    extract_template_str: str = Field(default=DEFAULT_EXTRACT_TEMPLATE_STR)

Inputs

Name	Type	Required	Description
program	BasePydanticProgram[Model]	Yes	Pydantic program that defines the extraction schema and LLM interaction
input_key	str	No	Key used as input to the program template (default: `"input"`)
extract_template_str	str	No	Template string for extraction context formatting

Output

Name	Type	Description
(model fields)	Dict[str, Any]	Dictionary of all fields from the extracted Pydantic model via `model_dump()`

Default Prompt Templates

Constant	Used By	Purpose
`DEFAULT_TITLE_NODE_TEMPLATE`	TitleExtractor	Extracts title clues from individual nodes
`DEFAULT_TITLE_COMBINE_TEMPLATE`	TitleExtractor	Combines candidate titles into a final document title
`DEFAULT_KEYWORD_EXTRACT_TEMPLATE`	KeywordExtractor	Generates comma-separated keywords from node content
`DEFAULT_QUESTION_GEN_TMPL`	QuestionsAnsweredExtractor	Generates questions uniquely answerable by the node
`DEFAULT_SUMMARY_EXTRACT_TEMPLATE`	SummaryExtractor	Summarizes key topics and entities of a section
`DEFAULT_EXTRACT_TEMPLATE_STR`	PydanticProgramExtractor	Formats section content for structured extraction

Helper Entities

add_class_name Function

def add_class_name(value: Any, handler: Callable, info: Any) -> Dict[str, Any]

A serialization helper that adds a class_name field to serialized output if the value has a class_name() method.

DEFAULT_ENTITY_MAP

A dictionary mapping entity type codes (e.g., "PER", "ORG", "LOC") to human-readable category names. Covers 15 entity types including persons, organizations, locations, animals, diseases, events, and more. Uses the tomaarsen/span-marker-mbert-base-multinerd model as the default NER model.

Usage Examples

Basic Usage

from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    QuestionsAnsweredExtractor,
    SummaryExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        TitleExtractor(nodes=5),
        KeywordExtractor(keywords=10),
        QuestionsAnsweredExtractor(questions=3),
        SummaryExtractor(summaries=["self", "prev", "next"]),
    ]
)

nodes = pipeline.run(documents=documents)

PydanticProgramExtractor Usage

from llama_index.core.extractors import PydanticProgramExtractor
from pydantic import BaseModel

class EntityInfo(BaseModel):
    name: str
    description: str
    category: str

extractor = PydanticProgramExtractor(
    program=my_pydantic_program,  # BasePydanticProgram instance
    input_key="input",
)

metadata_list = await extractor.aextract(nodes)

Related Pages

Environment:Run_llama_Llama_index_Python_LlamaIndex_Core

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment