Overview
Concrete tool for constructing document element objects provided by the Unstructured library.
Description
The Element abstract base class and its subclasses (Text, NarrativeText, Title, Table, Image, etc.) define the core data model for document content. The __init__ method initializes an element with a unique ID, optional spatial coordinates, metadata, and detection origin. The Text.apply method enables post-processing by applying cleaner functions to element text content.
Usage
Import Element subclasses when constructing elements manually (e.g., in custom partitioners or tests), when type-checking elements from partition output, or when applying text cleaning operations via Text.apply.
Code Reference
Source Location
- Repository: unstructured
- File: unstructured/documents/elements.py
- Lines: 662-860
Signature
class Element(abc.ABC):
def __init__(
self,
element_id: Optional[str] = None,
coordinates: Optional[tuple[tuple[float, float], ...]] = None,
coordinate_system: Optional[CoordinateSystem] = None,
metadata: Optional[ElementMetadata] = None,
detection_origin: Optional[str] = None,
):
"""Initialize a document element.
Args:
element_id: Unique identifier (auto-generated UUID if None).
coordinates: Bounding box coordinates as tuple of (x, y) points.
coordinate_system: Coordinate system for the bounding box.
metadata: Rich metadata container (ElementMetadata).
detection_origin: Origin of this element detection (e.g., model name).
"""
class Text(Element):
def __init__(
self,
text: str,
*args,
**kwargs,
):
"""Initialize a text-bearing element.
Args:
text: The text content of this element.
"""
def apply(self, *cleaners: Callable[[str], str]):
"""Apply cleaner functions to the element's text content.
Args:
cleaners: One or more functions that take a string and return a cleaned string.
"""
Import
from unstructured.documents.elements import (
Element,
Text,
NarrativeText,
Title,
Table,
Image,
ListItem,
Header,
Footer,
FigureCaption,
CompositeElement,
ElementMetadata,
)
I/O Contract
Inputs (Element.__init__)
| Name |
Type |
Required |
Description
|
| element_id |
None |
No |
Unique identifier (auto-generated if None)
|
| coordinates |
None |
No |
Bounding box as tuple of (x, y) points
|
| coordinate_system |
None |
No |
Coordinate reference system
|
| metadata |
None |
No |
Rich metadata container
|
| detection_origin |
None |
No |
Source of element detection
|
Inputs (Text.apply)
| Name |
Type |
Required |
Description
|
| cleaners |
Callable[[str], str] |
Yes |
One or more text cleaning functions
|
Outputs
| Name |
Type |
Description
|
| Element instance |
Element subclass |
Constructed element with ID, metadata, and optional coordinates
|
| apply (side effect) |
None |
Modifies element text in-place
|
Usage Examples
Create Elements Manually
from unstructured.documents.elements import NarrativeText, Title, ElementMetadata
title = Title(
text="Introduction",
metadata=ElementMetadata(page_number=1, filename="report.pdf"),
)
paragraph = NarrativeText(
text="This report summarizes the findings of our analysis.",
metadata=ElementMetadata(page_number=1, filename="report.pdf"),
)
Apply Text Cleaners
from unstructured.documents.elements import NarrativeText
element = NarrativeText(text=" Extra whitespace here ")
# Apply a simple whitespace normalizer
element.apply(lambda s: " ".join(s.split()))
print(str(element)) # "Extra whitespace here"
Type-Check Partition Output
from unstructured.partition.auto import partition
from unstructured.documents.elements import Title, Table
elements = partition(filename="report.pdf")
titles = [el for el in elements if isinstance(el, Title)]
tables = [el for el in elements if isinstance(el, Table)]
Related Pages
Implements Principle