Principle:Unstructured IO Unstructured Memory Profiling
| Knowledge Sources | |
|---|---|
| Domains | Performance, Profiling, Memory_Management |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A profiling technique that tracks memory allocation and deallocation during document partitioning to identify memory-bound bottlenecks and leaks.
Description
Memory profiling tracks every memory allocation made during pipeline execution, recording the allocation size, call stack, and lifetime. This reveals which functions allocate the most memory, where memory peaks occur, and whether memory is being released properly.
The Unstructured profiling suite uses memray, a Python memory profiler that instruments the allocator at the C level for accurate, low-overhead tracking. Memray produces binary profiles that can be visualized as memory flamegraphs, allocation tables, call trees, and summary statistics.
Usage
Use this principle when partition operations consume unexpectedly large amounts of memory, cause out-of-memory errors, or exhibit memory growth over time. Memory profiling is essential for processing large documents (e.g., 1000+ page PDFs) where memory usage can become the limiting factor.
Theoretical Basis
Allocator instrumentation: Memray hooks into Python's memory allocator to record every malloc, realloc, and free call with the corresponding Python call stack. This provides complete visibility into memory behavior.
Key metrics:
- Peak memory: Maximum memory usage during execution
- Total allocations: Number and size of all allocations
- Allocation hotspots: Functions responsible for the most allocations
- Memory timeline: How memory usage changes over time
Visualizations:
- Flamegraph: Shows allocation call stacks proportional to bytes allocated
- Table: Lists top allocators sorted by total bytes
- Tree: Hierarchical view of allocation call chains
- Summary/Stats: Aggregate memory metrics