Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Unstructured IO Unstructured OpenAIEmbeddingEncoder Embed Documents

From Leeroopedia
Revision as of 11:54, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Unstructured_IO_Unstructured_OpenAIEmbeddingEncoder_Embed_Documents.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, RAG, Embeddings
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for generating OpenAI vector embeddings for document elements, wrapping the OpenAI API via LangChain.

Description

The OpenAIEmbeddingEncoder implements BaseEmbeddingEncoder using OpenAI's text-embedding-ada-002 model (configurable). It uses OpenAIEmbeddingConfig for API key and model name configuration, and delegates to LangChain's OpenAIEmbeddings client for the actual API calls. The embed_documents method converts each element to a string, embeds the text, and stores the resulting vector in element.embeddings.

Usage

Import this class when you need to add OpenAI embeddings to document elements. This is used in the ingest pipeline via the --embedding-provider openai flag, or programmatically in custom pipelines.

Code Reference

Source Location

  • Repository: unstructured
  • File: unstructured/embed/openai.py
  • Lines: 17-68

Signature

class OpenAIEmbeddingConfig(EmbeddingConfig):
    api_key: SecretStr
    model_name: str = Field(default="text-embedding-ada-002")

    @requires_dependencies(["langchain_openai"], extras="openai")
    def get_client(self) -> "OpenAIEmbeddings":
        """Create LangChain OpenAI embeddings client."""

@dataclass
class OpenAIEmbeddingEncoder(BaseEmbeddingEncoder):
    config: OpenAIEmbeddingConfig

    def initialize(self):
        """Initialize the OpenAI client via config.get_client()."""

    @property
    def num_of_dimensions(self) -> Tuple[int]:
        """Returns (1536,) for ada-002."""

    @property
    def is_unit_vector(self) -> bool:
        """Returns True (OpenAI embeddings are L2-normalized)."""

    def embed_documents(self, elements: List[Element]) -> List[Element]:
        """Embed elements using OpenAI API, storing vectors in element.embeddings."""

    def embed_query(self, query: str) -> List[float]:
        """Embed a single query string."""

Import

from unstructured.embed.openai import OpenAIEmbeddingConfig, OpenAIEmbeddingEncoder

I/O Contract

Inputs (Configuration)

Name Type Required Description
api_key SecretStr Yes OpenAI API key
model_name str No Embedding model name (default "text-embedding-ada-002")

Inputs (embed_documents)

Name Type Required Description
elements List[Element] Yes Elements to embed (converted to strings via str())

Outputs

Name Type Description
return List[Element] Same elements with element.embeddings populated as list[float] (1536 dimensions for ada-002)

Usage Examples

Embed Document Elements

from unstructured.partition.auto import partition
from unstructured.embed.openai import OpenAIEmbeddingConfig, OpenAIEmbeddingEncoder

# 1. Partition document
elements = partition(filename="report.pdf")

# 2. Configure OpenAI embedding
config = OpenAIEmbeddingConfig(api_key="sk-your-key-here")
encoder = OpenAIEmbeddingEncoder(config=config)
encoder.initialize()

# 3. Embed elements
embedded = encoder.embed_documents(elements)

# 4. Access embeddings
for el in embedded:
    if el.embeddings:
        print(f"{type(el).__name__}: {len(el.embeddings)} dimensions")

Embed a Query for Similarity Search

from unstructured.embed.openai import OpenAIEmbeddingConfig, OpenAIEmbeddingEncoder

config = OpenAIEmbeddingConfig(api_key="sk-your-key-here")
encoder = OpenAIEmbeddingEncoder(config=config)
encoder.initialize()

query_vector = encoder.embed_query("What are the financial results?")
print(f"Query vector dimensions: {len(query_vector)}")  # 1536

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment