Implementation:Unstructured IO Unstructured Golden File Fixtures Embedding
| Knowledge Sources | |
|---|---|
| Domains | Testing, Embedding, Ingest |
| Last Updated | 2026-02-12 09:30 GMT |
Overview
Golden file test fixtures containing expected JSON element output for embedding provider ingest pipelines (default, Bedrock, MixedBread AI, Vertex AI, Voyage AI).
Description
These JSON files represent the expected structured output from processing documents through embedding-enriched ingest pipelines. Each file contains a JSON array of element objects that include embedding vectors alongside standard element data. They serve as regression baselines for validating that embedding integration does not alter the element extraction or embedding attachment behavior.
The embedding providers covered include:
- Default (OpenAI) — Standard embedding pipeline (5,293 lines)
- AWS Bedrock — Amazon Bedrock embedding service (20,269 lines)
- MixedBread AI — MixedBread embedding API (13,587 lines)
- Vertex AI — Google Vertex AI embedding (10,285 lines)
- Voyage AI — Voyage AI embedding service (20,243 lines)
The large file sizes reflect the inclusion of embedding vectors (high-dimensional float arrays) in each element's metadata.
Usage
These fixtures are consumed by the embedding ingest test scripts (e.g., `test_unstructured_ingest/src/local-embed-bedrock.sh`). They verify that the full pipeline — partition, embed, serialize — produces consistent output.
Code Reference
Source Location
- Repository: Unstructured_IO_Unstructured
- File: test_unstructured_ingest/expected-structured-output/ (embed subdirectories)
Files Covered
| Provider | File | Lines |
|---|---|---|
| Default | embed/book-war-and-peace-1p.txt.json | 5293 |
| Bedrock | embed-bedrock/book-war-and-peace-1p.txt.json | 20269 |
| MixedBread AI | embed-mixedbreadai/book-war-and-peace-1p.txt.json | 13587 |
| Vertex AI | embed-vertexai/book-war-and-peace-1p.txt.json | 10285 |
| Voyage AI | embed-voyageai/book-war-and-peace-1p.txt.json | 20243 |
Signature
[
{
"type": "NarrativeText",
"element_id": "hex_string",
"text": "Content from book-war-and-peace-1p.txt",
"embeddings": [0.0123, -0.0456, 0.0789, ...],
"metadata": {
"languages": ["eng"],
"filetype": "text/plain",
"data_source": {
"url": "path/to/book-war-and-peace-1p.txt"
}
}
}
]
Import
# Not importable — consumed by test scripts
import json
with open("test_unstructured_ingest/expected-structured-output/embed/book-war-and-peace-1p.txt.json") as f:
expected_elements = json.load(f)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| JSON file path | str | Yes | Path to a golden file under expected-structured-output/embed*/ |
Outputs
| Name | Type | Description |
|---|---|---|
| elements | List[Dict] | JSON array of element dicts with type, element_id, text, embeddings, and metadata |
| embeddings | List[float] | Embedding vector attached to each element (provider-specific dimensionality) |
| metadata.filetype | str | MIME type of the original document |
Usage Examples
Verifying Embedding Dimensions
import json
# Load golden file and check embedding dimensions
with open("test_unstructured_ingest/expected-structured-output/embed/book-war-and-peace-1p.txt.json") as f:
elements = json.load(f)
for elem in elements:
if "embeddings" in elem:
dim = len(elem["embeddings"])
print(f"Element '{elem['text'][:40]}...' — embedding dim: {dim}")
break