Implementation:Unstructured IO Unstructured Golden File Fixtures Embedding

Knowledge Sources	Unstructured_IO_Unstructured
Domains	Testing, Embedding, Ingest
Last Updated	2026-02-12 09:30 GMT

Overview

Golden file test fixtures containing expected JSON element output for embedding provider ingest pipelines (default, Bedrock, MixedBread AI, Vertex AI, Voyage AI).

Description

These JSON files represent the expected structured output from processing documents through embedding-enriched ingest pipelines. Each file contains a JSON array of element objects that include embedding vectors alongside standard element data. They serve as regression baselines for validating that embedding integration does not alter the element extraction or embedding attachment behavior.

The embedding providers covered include:

Default (OpenAI) — Standard embedding pipeline (5,293 lines)
AWS Bedrock — Amazon Bedrock embedding service (20,269 lines)
MixedBread AI — MixedBread embedding API (13,587 lines)
Vertex AI — Google Vertex AI embedding (10,285 lines)
Voyage AI — Voyage AI embedding service (20,243 lines)

The large file sizes reflect the inclusion of embedding vectors (high-dimensional float arrays) in each element's metadata.

Usage

These fixtures are consumed by the embedding ingest test scripts (e.g., `test_unstructured_ingest/src/local-embed-bedrock.sh`). They verify that the full pipeline — partition, embed, serialize — produces consistent output.

Code Reference

Source Location

Repository: Unstructured_IO_Unstructured
File: test_unstructured_ingest/expected-structured-output/ (embed subdirectories)

Files Covered

Provider	File	Lines
Default	embed/book-war-and-peace-1p.txt.json	5293
Bedrock	embed-bedrock/book-war-and-peace-1p.txt.json	20269
MixedBread AI	embed-mixedbreadai/book-war-and-peace-1p.txt.json	13587
Vertex AI	embed-vertexai/book-war-and-peace-1p.txt.json	10285
Voyage AI	embed-voyageai/book-war-and-peace-1p.txt.json	20243

Signature

[
  {
    "type": "NarrativeText",
    "element_id": "hex_string",
    "text": "Content from book-war-and-peace-1p.txt",
    "embeddings": [0.0123, -0.0456, 0.0789, ...],
    "metadata": {
      "languages": ["eng"],
      "filetype": "text/plain",
      "data_source": {
        "url": "path/to/book-war-and-peace-1p.txt"
      }
    }
  }
]

Import

# Not importable — consumed by test scripts
import json
with open("test_unstructured_ingest/expected-structured-output/embed/book-war-and-peace-1p.txt.json") as f:
    expected_elements = json.load(f)

I/O Contract

Inputs

Name	Type	Required	Description
JSON file path	str	Yes	Path to a golden file under expected-structured-output/embed*/

Outputs

Name	Type	Description
elements	List[Dict]	JSON array of element dicts with type, element_id, text, embeddings, and metadata
embeddings	List[float]	Embedding vector attached to each element (provider-specific dimensionality)
metadata.filetype	str	MIME type of the original document

Usage Examples

Verifying Embedding Dimensions

import json

# Load golden file and check embedding dimensions
with open("test_unstructured_ingest/expected-structured-output/embed/book-war-and-peace-1p.txt.json") as f:
    elements = json.load(f)

for elem in elements:
    if "embeddings" in elem:
        dim = len(elem["embeddings"])
        print(f"Element '{elem['text'][:40]}...' — embedding dim: {dim}")
        break

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment