Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Unstructured IO Unstructured OpenAI API

From Leeroopedia
Revision as of 18:46, 16 February 2026 by Admin (talk | contribs) (Auto-imported from environments/Unstructured_IO_Unstructured_OpenAI_API.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Embeddings
Last Updated 2026-02-12 09:00 GMT

Overview

The OpenAI_API environment provides the dependencies and configuration needed to generate document embeddings using the OpenAI API via the langchain_openai integration.

Description

The OpenAI embedding encoder in unstructured uses the langchain_openai package as its client interface to the OpenAI API. The openai.py module decorates its get_client() method with @requires_dependencies(["langchain_openai"], extras="openai"), enforcing that the correct extra is installed before any API calls are attempted. The import of langchain_openai is performed lazily inside the get_client() method body, meaning the dependency is only loaded at runtime when embedding is actually requested.

The default embedding model is text-embedding-ada-002. The API key is managed through a Pydantic SecretStr field, which is typically populated from the OPENAI_API_KEY environment variable. This approach ensures the key is not accidentally logged or serialized in plain text.

Usage

This environment is required when using the OpenAIEmbeddingEncoder to generate vector embeddings for document elements. This is typically used in retrieval-augmented generation (RAG) pipelines where partitioned document elements need to be embedded for semantic search.

System Requirements

Category Requirement Notes
Python >= 3.11, < 3.14 Required Python version range
OS Any No OS-specific requirements; API calls are network-based
Network Internet access required Must be able to reach the OpenAI API endpoint

Dependencies

System Packages

  • No system packages required beyond Python itself

Python Packages

  • langchain_openai -- LangChain wrapper for OpenAI API (installed via the openai extra)
  • openai -- underlying OpenAI Python client (transitive dependency of langchain_openai)
  • pydantic -- data validation with SecretStr for secure API key handling (transitive dependency)

Credentials

  • OPENAI_API_KEY -- OpenAI API key (required; passed as Pydantic SecretStr to prevent accidental exposure in logs)

Quick Install

# Install unstructured with OpenAI extras
pip install "unstructured[openai]"

# Set the API key environment variable
export OPENAI_API_KEY="sk-..."

Code Evidence

Dependency requirement decorator (openai.py):

@requires_dependencies(["langchain_openai"], extras="openai")
def get_client(self):
    from langchain_openai import OpenAIEmbeddings
    return OpenAIEmbeddings(
        model=self.model_name,
        openai_api_key=self.api_key,
    )

Default model configuration (openai.py):

model_name: str = "text-embedding-ada-002"
api_key: SecretStr

Common Errors

Error Message Cause Solution
ImportError: langchain_openai is required. Install with: pip install "unstructured[openai]" The openai extra is not installed Install via pip install "unstructured[openai]"
AuthenticationError: Incorrect API key provided Invalid or expired OPENAI_API_KEY Verify the API key is correct and active in your OpenAI dashboard
RateLimitError: Rate limit reached Too many API requests in a short period Implement retry logic with exponential backoff, or reduce batch size
ValidationError: api_key field required OPENAI_API_KEY environment variable not set Export the variable: export OPENAI_API_KEY="sk-..."

Compatibility Notes

  • The langchain_openai package is used instead of the raw openai package to maintain consistency with the LangChain ecosystem
  • The lazy import pattern in get_client() means the dependency is only needed at runtime, not at module import time
  • SecretStr from Pydantic ensures the API key is masked in string representations, logs, and serialized output
  • The default model text-embedding-ada-002 can be overridden by passing a different model name during encoder initialization

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment