Overview
Implements five concrete LLM-based metadata extractors -- TitleExtractor, KeywordExtractor, QuestionsAnsweredExtractor, SummaryExtractor, and PydanticProgramExtractor -- each enriching document nodes with different types of metadata to improve RAG retrieval quality.
Description
This module provides the primary collection of metadata enrichment tools in LlamaIndex. All five extractors extend BaseExtractor and use LLM-based generation with async parallel processing via run_jobs. Each extractor targets a specific type of metadata:
- TitleExtractor -- Extracts document-level titles by gathering title candidates from the first N nodes of each document, then combining them into a single
document_title field
- KeywordExtractor -- Generates
excerpt_keywords for each individual node using a configurable keyword count
- QuestionsAnsweredExtractor -- Produces
questions_this_excerpt_can_answer metadata, generating questions that are uniquely answerable by each chunk
- SummaryExtractor -- Creates
section_summary and optionally prev_section_summary/next_section_summary for adjacent node context sharing
- PydanticProgramExtractor -- Uses a
BasePydanticProgram to extract structured data conforming to a user-defined Pydantic model
All extractors accept customizable prompt templates and LLM instances, with sensible defaults provided. They support the deprecated llm_predictor parameter for backward compatibility.
Usage
Use these extractors in an ingestion pipeline to enrich nodes with metadata before indexing. They improve retrieval accuracy by adding semantic signals (titles, keywords, summaries, questions) that help vector stores and keyword-based retrievers find relevant chunks. Choose the appropriate extractor based on the type of metadata signal most valuable for your use case.
Code Reference
Source Location
- Repository: Run_llama_Llama_index
- File:
llama-index-core/llama_index/core/extractors/metadata_extractors.py
- Lines: 1-535
Import
from llama_index.core.extractors import (
TitleExtractor,
KeywordExtractor,
QuestionsAnsweredExtractor,
SummaryExtractor,
PydanticProgramExtractor,
)
Signature
class TitleExtractor(BaseExtractor):
is_text_node_only: bool = False
llm: SerializeAsAny[LLM]
nodes: int = Field(default=5, gt=0)
node_template: str = Field(default=DEFAULT_TITLE_NODE_TEMPLATE)
combine_template: str = Field(default=DEFAULT_TITLE_COMBINE_TEMPLATE)
def __init__(
self,
llm: Optional[LLM] = None,
llm_predictor: Optional[LLM] = None,
nodes: int = 5,
node_template: str = DEFAULT_TITLE_NODE_TEMPLATE,
combine_template: str = DEFAULT_TITLE_COMBINE_TEMPLATE,
num_workers: int = DEFAULT_NUM_WORKERS,
**kwargs: Any,
) -> None
Inputs
| Name |
Type |
Required |
Description
|
| llm |
Optional[LLM] |
No |
Language model for generation; defaults to Settings.llm
|
| nodes |
int |
No |
Number of nodes from the front of each document to use for title extraction (default: 5)
|
| node_template |
str |
No |
Prompt template for extracting per-node title clues
|
| combine_template |
str |
No |
Prompt template for combining node-level clues into a document title
|
| num_workers |
int |
No |
Number of parallel workers (default: DEFAULT_NUM_WORKERS)
|
Output
| Name |
Type |
Description
|
| document_title |
str |
Inferred document title stored in each node's metadata
|
Key Methods
aextract(nodes) -- Groups nodes by ref_doc_id, extracts titles per document, returns metadata dicts
separate_nodes_by_ref_id(nodes) -- Groups nodes by document reference, taking at most N nodes per document
extract_titles(nodes_by_doc_id) -- Generates title candidates and combines them using the LLM
get_title_candidates(nodes) -- Extracts individual title clues from each node
Signature
class KeywordExtractor(BaseExtractor):
llm: SerializeAsAny[LLM]
keywords: int = Field(default=5, gt=0)
prompt_template: str = Field(default=DEFAULT_KEYWORD_EXTRACT_TEMPLATE)
def __init__(
self,
llm: Optional[LLM] = None,
llm_predictor: Optional[LLM] = None,
keywords: int = 5,
prompt_template: str = DEFAULT_KEYWORD_EXTRACT_TEMPLATE,
num_workers: int = DEFAULT_NUM_WORKERS,
**kwargs: Any,
) -> None
Inputs
| Name |
Type |
Required |
Description
|
| llm |
Optional[LLM] |
No |
Language model for generation; defaults to Settings.llm
|
| keywords |
int |
No |
Number of keywords to extract per node (default: 5)
|
| prompt_template |
str |
No |
Prompt template for keyword extraction
|
| num_workers |
int |
No |
Number of parallel workers
|
Output
| Name |
Type |
Description
|
| excerpt_keywords |
str |
Comma-separated keywords stored in node metadata
|
Signature
class QuestionsAnsweredExtractor(BaseExtractor):
llm: SerializeAsAny[LLM]
questions: int = Field(default=5, gt=0)
prompt_template: str = Field(default=DEFAULT_QUESTION_GEN_TMPL)
embedding_only: bool = Field(default=True)
def __init__(
self,
llm: Optional[LLM] = None,
llm_predictor: Optional[LLM] = None,
questions: int = 5,
prompt_template: str = DEFAULT_QUESTION_GEN_TMPL,
embedding_only: bool = True,
num_workers: int = DEFAULT_NUM_WORKERS,
**kwargs: Any,
) -> None
Inputs
| Name |
Type |
Required |
Description
|
| llm |
Optional[LLM] |
No |
Language model for generation; defaults to Settings.llm
|
| questions |
int |
No |
Number of questions to generate per node (default: 5)
|
| prompt_template |
str |
No |
Prompt template for question generation
|
| embedding_only |
bool |
No |
Whether to use metadata for embeddings only (default: True)
|
| num_workers |
int |
No |
Number of parallel workers
|
Output
| Name |
Type |
Description
|
| questions_this_excerpt_can_answer |
str |
Generated questions stored in node metadata
|
Signature
class SummaryExtractor(BaseExtractor):
llm: SerializeAsAny[LLM]
summaries: List[str]
prompt_template: str = Field(default=DEFAULT_SUMMARY_EXTRACT_TEMPLATE)
def __init__(
self,
llm: Optional[LLM] = None,
llm_predictor: Optional[LLM] = None,
summaries: List[str] = ["self"],
prompt_template: str = DEFAULT_SUMMARY_EXTRACT_TEMPLATE,
num_workers: int = DEFAULT_NUM_WORKERS,
**kwargs: Any,
) -> None
Inputs
| Name |
Type |
Required |
Description
|
| llm |
Optional[LLM] |
No |
Language model for generation; defaults to Settings.llm
|
| summaries |
List[str] |
No |
List of summary types: "self", "prev", "next" (default: ["self"])
|
| prompt_template |
str |
No |
Prompt template for summary extraction
|
| num_workers |
int |
No |
Number of parallel workers
|
Output
| Name |
Type |
Description
|
| section_summary |
str |
Summary of the current node (when "self" in summaries)
|
| prev_section_summary |
str |
Summary of the previous node (when "prev" in summaries)
|
| next_section_summary |
str |
Summary of the next node (when "next" in summaries)
|
Signature
class PydanticProgramExtractor(BaseExtractor, Generic[Model]):
program: SerializeAsAny[BasePydanticProgram[Model]]
input_key: str = Field(default="input")
extract_template_str: str = Field(default=DEFAULT_EXTRACT_TEMPLATE_STR)
Inputs
| Name |
Type |
Required |
Description
|
| program |
BasePydanticProgram[Model] |
Yes |
Pydantic program that defines the extraction schema and LLM interaction
|
| input_key |
str |
No |
Key used as input to the program template (default: "input")
|
| extract_template_str |
str |
No |
Template string for extraction context formatting
|
Output
| Name |
Type |
Description
|
| (model fields) |
Dict[str, Any] |
Dictionary of all fields from the extracted Pydantic model via model_dump()
|
Default Prompt Templates
| Constant |
Used By |
Purpose
|
DEFAULT_TITLE_NODE_TEMPLATE |
TitleExtractor |
Extracts title clues from individual nodes
|
DEFAULT_TITLE_COMBINE_TEMPLATE |
TitleExtractor |
Combines candidate titles into a final document title
|
DEFAULT_KEYWORD_EXTRACT_TEMPLATE |
KeywordExtractor |
Generates comma-separated keywords from node content
|
DEFAULT_QUESTION_GEN_TMPL |
QuestionsAnsweredExtractor |
Generates questions uniquely answerable by the node
|
DEFAULT_SUMMARY_EXTRACT_TEMPLATE |
SummaryExtractor |
Summarizes key topics and entities of a section
|
DEFAULT_EXTRACT_TEMPLATE_STR |
PydanticProgramExtractor |
Formats section content for structured extraction
|
Helper Entities
add_class_name Function
def add_class_name(value: Any, handler: Callable, info: Any) -> Dict[str, Any]
A serialization helper that adds a class_name field to serialized output if the value has a class_name() method.
DEFAULT_ENTITY_MAP
A dictionary mapping entity type codes (e.g., "PER", "ORG", "LOC") to human-readable category names. Covers 15 entity types including persons, organizations, locations, animals, diseases, events, and more. Uses the tomaarsen/span-marker-mbert-base-multinerd model as the default NER model.
Usage Examples
Basic Usage
from llama_index.core.extractors import (
TitleExtractor,
KeywordExtractor,
QuestionsAnsweredExtractor,
SummaryExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(),
TitleExtractor(nodes=5),
KeywordExtractor(keywords=10),
QuestionsAnsweredExtractor(questions=3),
SummaryExtractor(summaries=["self", "prev", "next"]),
]
)
nodes = pipeline.run(documents=documents)
from llama_index.core.extractors import PydanticProgramExtractor
from pydantic import BaseModel
class EntityInfo(BaseModel):
name: str
description: str
category: str
extractor = PydanticProgramExtractor(
program=my_pydantic_program, # BasePydanticProgram instance
input_key="input",
)
metadata_list = await extractor.aextract(nodes)
Related Pages