Implementation:Run llama Llama index DocumentContextExtractor
| Knowledge Sources | |
|---|---|
| Domains | Metadata Extraction, RAG, Contextual Retrieval |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
An LLM-based context extractor that generates contextual metadata for document chunks by analyzing each chunk within the context of its parent document, implementing the Anthropic "Contextual Retrieval" approach to enhance RAG accuracy.
Description
DocumentContextExtractor extends BaseExtractor to enrich document nodes with contextual metadata. For each node, it retrieves the parent document from a docstore, sends both the full document and the individual chunk to an LLM, and stores the generated context as node metadata.
The extractor implements several key design decisions:
- Prompt strategies: Two built-in prompt constants are provided as class variables:
ORIGINAL_CONTEXT_PROMPT-- Generates a short succinct context to situate a chunk within the overall document (from the Anthropic cookbook)SUCCINCT_CONTEXT_PROMPT-- Generates keyword-laden descriptions for better search matching
- Rate limit handling: Exponential backoff retry logic with 5 retries starting at 60-second base delay, with jitter
- Oversized document strategies: Configurable handling via
OversizeStrategyliteral type ("warn","error", or"ignore") - Token counting: Uses
@lru_cache(maxsize=1000)for cached token counting to avoid redundant computation on repeated documents - Prompt caching: Sends the document text with
cache_control: ephemeralheaders to leverage Anthropic API prompt caching - Sorting optimization: Sorts nodes by source document ID before processing to maximize prompt cache hits, which can save significant API costs
- Skip logic: Nodes that already have the metadata key set are skipped to avoid reprocessing
Usage
Use this extractor when you want to improve retrieval accuracy for RAG pipelines where individual chunks may lack standalone semantic meaning. It is particularly effective for documents where chunk boundaries split related content across nodes. Requires an LLM with async chat support (the achat method) and a configured docstore containing the parent documents.
Code Reference
Source Location
- Repository: Run_llama_Llama_index
- File:
llama-index-core/llama_index/core/extractors/document_context.py - Lines: 1-351
Signature
class DocumentContextExtractor(BaseExtractor):
# Pydantic fields
llm: LLM
docstore: BaseDocumentStore
key: str
prompt: str
doc_ids: Set[str]
max_context_length: int
max_output_tokens: int
oversized_document_strategy: OversizeStrategy
num_workers: int = DEFAULT_NUM_WORKERS
def __init__(
self,
docstore: BaseDocumentStore,
llm: Optional[LLM] = None,
max_context_length: int = 1000,
key: str = DEFAULT_KEY,
prompt: str = ORIGINAL_CONTEXT_PROMPT,
num_workers: int = DEFAULT_NUM_WORKERS,
max_output_tokens: int = 512,
oversized_document_strategy: OversizeStrategy = "warn",
**kwargs: Any,
) -> None
Import
from llama_index.core.extractors.document_context import DocumentContextExtractor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| docstore | BaseDocumentStore | Yes | Storage for parent documents; used to retrieve full document text for each node |
| llm | Optional[LLM] | No | Language model instance with achat method; defaults to Settings.llm
|
| max_context_length | int | No | Maximum allowed document context length in tokens (default: 1000) |
| key | str | No | Metadata key for storing extracted context (default: "context")
|
| prompt | str | No | Prompt template for context generation (default: ORIGINAL_CONTEXT_PROMPT) |
| num_workers | int | No | Number of parallel workers for async processing (default: DEFAULT_NUM_WORKERS) |
| max_output_tokens | int | No | Maximum tokens in generated context (default: 512) |
| oversized_document_strategy | OversizeStrategy | No | Strategy for handling documents exceeding max_context_length: "warn", "error", or "ignore" (default: "warn")
|
Outputs
| Name | Type | Description |
|---|---|---|
| metadata_list | List[Dict] | List of metadata dictionaries, one per input node, containing the generated context under the configured key |
Key Methods
aextract
async def aextract(self, nodes: Sequence[BaseNode]) -> List[Dict]
Main entry point. Sorts nodes by source document ID for prompt cache optimization, retrieves parent documents, and dispatches parallel context generation jobs via run_jobs.
_agenerate_node_context
async def _agenerate_node_context(
self,
node: Union[Node, TextNode],
metadata: Dict,
document: Union[Node, TextNode],
prompt: str,
key: str,
) -> Dict
Generates context for a single node by sending the parent document and chunk to the LLM. Implements exponential backoff retry (5 retries, 60s base delay) for rate limit handling. Uses Anthropic prompt caching headers.
_get_document
async def _get_document(self, doc_id: str) -> Optional[Union[Node, TextNode]]
Retrieves a document from the docstore by ID. Validates that the document is a text node and applies the oversized document strategy if the token count exceeds max_context_length.
_count_tokens
@staticmethod
@lru_cache(maxsize=1000)
def _count_tokens(text: str) -> int
Cached token counting using Settings.tokenizer. The LRU cache avoids redundant tokenization on repeated documents.
Helper Functions
is_text_node
def is_text_node(node: BaseNode) -> TypeGuard[Union[Node, TextNode]]
Module-level type guard function that checks whether a node is an instance of Node or TextNode. Used throughout the extractor for type narrowing.
Constants
| Name | Description |
|---|---|
ORIGINAL_CONTEXT_PROMPT |
Prompts the LLM to generate a short succinct context situating a chunk within its document, designed for improving search retrieval |
SUCCINCT_CONTEXT_PROMPT |
Generates keyword-laden phrases describing main topics, entities, and actions; replaces pronouns with specific referents for better search matching |
OversizeStrategy |
Literal type alias for "warn", "error", or "ignore"
|
Usage Examples
Basic Usage
from llama_index.core.extractors.document_context import DocumentContextExtractor
from llama_index.core.storage.docstore import SimpleDocumentStore
docstore = SimpleDocumentStore()
# Add documents to docstore...
extractor = DocumentContextExtractor(
docstore=docstore,
llm=my_llm,
max_context_length=64000,
max_output_tokens=256,
)
# Extract context for nodes asynchronously
metadata_list = await extractor.aextract(nodes)
Using Succinct Prompt
extractor = DocumentContextExtractor(
docstore=docstore,
llm=my_llm,
prompt=DocumentContextExtractor.SUCCINCT_CONTEXT_PROMPT,
max_context_length=128000,
oversized_document_strategy="ignore",
)