Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Unstructured IO Unstructured Element Model

From Leeroopedia
Knowledge Sources
Domains Document_Processing, Data_Modeling
Last Updated 2026-02-12 00:00 GMT

Overview

A typed data model that represents individual units of content extracted from documents, providing a uniform interface across all document formats.

Description

The Element model is the core data abstraction in the Unstructured library. Every piece of content extracted from a document (a paragraph, title, table, image caption, list item, etc.) is represented as a typed Element object. The type hierarchy encodes semantic meaning: a Title element is structurally different from a NarrativeText element, even though both contain text.

Each element carries:

  • A unique identifier (element_id) generated from content hashing
  • Rich metadata (page number, coordinates, language, source information, links)
  • Optional spatial coordinates for layout-aware processing
  • Detection origin tracking for debugging extraction pipelines

This principle solves the representation problem: how to model heterogeneous document content in a uniform, type-safe way that preserves semantic structure without enforcing a rigid hierarchy.

Usage

Use this principle whenever you need to understand, create, or manipulate document elements. The Element model is the lingua franca of the Unstructured ecosystem: all partitioners produce elements, all chunkers consume elements, all serializers convert elements, and all embedding providers annotate elements.

Theoretical Basis

The Element model uses an inheritance hierarchy rooted in an abstract Element base class:

# Abstract type hierarchy (not runnable code)
Element (abstract base)
├── Text (base for text-bearing elements)
   ├── NarrativeText     # Body paragraphs
   ├── Title             # Section headings
   ├── ListItem          # Bullet/numbered items
   ├── Header            # Page headers
   ├── Footer            # Page footers
   ├── FigureCaption     # Image captions
   ├── Address           # Physical addresses
   ├── EmailAddress      # Email addresses
   ├── Formula           # Mathematical formulas
   └── CompositeElement  # Merged chunks
├── Table                 # Tabular data
├── Image                 # Image regions
├── PageBreak             # Page boundaries
└── CheckBox              # Form checkboxes

Key design decisions:

  • Flat sequence: Elements form an ordered list, not a tree. Document structure is encoded through element types and metadata (parent_id, category_depth) rather than nesting.
  • Content-based IDs: Element IDs are SHA-256 hashes of element content, ensuring deterministic identification.
  • Rich metadata: The ElementMetadata object carries 40+ optional fields covering source information, spatial coordinates, language detection, and format-specific data.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment