Principle:Unstructured IO Unstructured Element Model
| Knowledge Sources | |
|---|---|
| Domains | Document_Processing, Data_Modeling |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A typed data model that represents individual units of content extracted from documents, providing a uniform interface across all document formats.
Description
The Element model is the core data abstraction in the Unstructured library. Every piece of content extracted from a document (a paragraph, title, table, image caption, list item, etc.) is represented as a typed Element object. The type hierarchy encodes semantic meaning: a Title element is structurally different from a NarrativeText element, even though both contain text.
Each element carries:
- A unique identifier (element_id) generated from content hashing
- Rich metadata (page number, coordinates, language, source information, links)
- Optional spatial coordinates for layout-aware processing
- Detection origin tracking for debugging extraction pipelines
This principle solves the representation problem: how to model heterogeneous document content in a uniform, type-safe way that preserves semantic structure without enforcing a rigid hierarchy.
Usage
Use this principle whenever you need to understand, create, or manipulate document elements. The Element model is the lingua franca of the Unstructured ecosystem: all partitioners produce elements, all chunkers consume elements, all serializers convert elements, and all embedding providers annotate elements.
Theoretical Basis
The Element model uses an inheritance hierarchy rooted in an abstract Element base class:
# Abstract type hierarchy (not runnable code)
Element (abstract base)
├── Text (base for text-bearing elements)
│ ├── NarrativeText # Body paragraphs
│ ├── Title # Section headings
│ ├── ListItem # Bullet/numbered items
│ ├── Header # Page headers
│ ├── Footer # Page footers
│ ├── FigureCaption # Image captions
│ ├── Address # Physical addresses
│ ├── EmailAddress # Email addresses
│ ├── Formula # Mathematical formulas
│ └── CompositeElement # Merged chunks
├── Table # Tabular data
├── Image # Image regions
├── PageBreak # Page boundaries
└── CheckBox # Form checkboxes
Key design decisions:
- Flat sequence: Elements form an ordered list, not a tree. Document structure is encoded through element types and metadata (parent_id, category_depth) rather than nesting.
- Content-based IDs: Element IDs are SHA-256 hashes of element content, ensuring deterministic identification.
- Rich metadata: The ElementMetadata object carries 40+ optional fields covering source information, spatial coordinates, language detection, and format-specific data.