Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unstructured IO Unstructured Golden File Fixtures Embedding

From Leeroopedia
Knowledge Sources
Domains Testing, Embedding, Ingest
Last Updated 2026-02-12 09:30 GMT

Overview

Golden file test fixtures containing expected JSON element output for embedding provider ingest pipelines (default, Bedrock, MixedBread AI, Vertex AI, Voyage AI).

Description

These JSON files represent the expected structured output from processing documents through embedding-enriched ingest pipelines. Each file contains a JSON array of element objects that include embedding vectors alongside standard element data. They serve as regression baselines for validating that embedding integration does not alter the element extraction or embedding attachment behavior.

The embedding providers covered include:

  • Default (OpenAI) — Standard embedding pipeline (5,293 lines)
  • AWS Bedrock — Amazon Bedrock embedding service (20,269 lines)
  • MixedBread AI — MixedBread embedding API (13,587 lines)
  • Vertex AI — Google Vertex AI embedding (10,285 lines)
  • Voyage AI — Voyage AI embedding service (20,243 lines)

The large file sizes reflect the inclusion of embedding vectors (high-dimensional float arrays) in each element's metadata.

Usage

These fixtures are consumed by the embedding ingest test scripts (e.g., `test_unstructured_ingest/src/local-embed-bedrock.sh`). They verify that the full pipeline — partition, embed, serialize — produces consistent output.

Code Reference

Source Location

Files Covered

Provider File Lines
Default embed/book-war-and-peace-1p.txt.json 5293
Bedrock embed-bedrock/book-war-and-peace-1p.txt.json 20269
MixedBread AI embed-mixedbreadai/book-war-and-peace-1p.txt.json 13587
Vertex AI embed-vertexai/book-war-and-peace-1p.txt.json 10285
Voyage AI embed-voyageai/book-war-and-peace-1p.txt.json 20243

Signature

[
  {
    "type": "NarrativeText",
    "element_id": "hex_string",
    "text": "Content from book-war-and-peace-1p.txt",
    "embeddings": [0.0123, -0.0456, 0.0789, ...],
    "metadata": {
      "languages": ["eng"],
      "filetype": "text/plain",
      "data_source": {
        "url": "path/to/book-war-and-peace-1p.txt"
      }
    }
  }
]

Import

# Not importable — consumed by test scripts
import json
with open("test_unstructured_ingest/expected-structured-output/embed/book-war-and-peace-1p.txt.json") as f:
    expected_elements = json.load(f)

I/O Contract

Inputs

Name Type Required Description
JSON file path str Yes Path to a golden file under expected-structured-output/embed*/

Outputs

Name Type Description
elements List[Dict] JSON array of element dicts with type, element_id, text, embeddings, and metadata
embeddings List[float] Embedding vector attached to each element (provider-specific dimensionality)
metadata.filetype str MIME type of the original document

Usage Examples

Verifying Embedding Dimensions

import json

# Load golden file and check embedding dimensions
with open("test_unstructured_ingest/expected-structured-output/embed/book-war-and-peace-1p.txt.json") as f:
    elements = json.load(f)

for elem in elements:
    if "embeddings" in elem:
        dim = len(elem["embeddings"])
        print(f"Element '{elem['text'][:40]}...' — embedding dim: {dim}")
        break

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment