Overview
Concrete tool for generating OpenAI vector embeddings for document elements, wrapping the OpenAI API via LangChain.
Description
The OpenAIEmbeddingEncoder implements BaseEmbeddingEncoder using OpenAI's text-embedding-ada-002 model (configurable). It uses OpenAIEmbeddingConfig for API key and model name configuration, and delegates to LangChain's OpenAIEmbeddings client for the actual API calls. The embed_documents method converts each element to a string, embeds the text, and stores the resulting vector in element.embeddings.
Usage
Import this class when you need to add OpenAI embeddings to document elements. This is used in the ingest pipeline via the --embedding-provider openai flag, or programmatically in custom pipelines.
Code Reference
Source Location
- Repository: unstructured
- File: unstructured/embed/openai.py
- Lines: 17-68
Signature
class OpenAIEmbeddingConfig(EmbeddingConfig):
api_key: SecretStr
model_name: str = Field(default="text-embedding-ada-002")
@requires_dependencies(["langchain_openai"], extras="openai")
def get_client(self) -> "OpenAIEmbeddings":
"""Create LangChain OpenAI embeddings client."""
@dataclass
class OpenAIEmbeddingEncoder(BaseEmbeddingEncoder):
config: OpenAIEmbeddingConfig
def initialize(self):
"""Initialize the OpenAI client via config.get_client()."""
@property
def num_of_dimensions(self) -> Tuple[int]:
"""Returns (1536,) for ada-002."""
@property
def is_unit_vector(self) -> bool:
"""Returns True (OpenAI embeddings are L2-normalized)."""
def embed_documents(self, elements: List[Element]) -> List[Element]:
"""Embed elements using OpenAI API, storing vectors in element.embeddings."""
def embed_query(self, query: str) -> List[float]:
"""Embed a single query string."""
Import
from unstructured.embed.openai import OpenAIEmbeddingConfig, OpenAIEmbeddingEncoder
I/O Contract
Inputs (Configuration)
| Name |
Type |
Required |
Description
|
| api_key |
SecretStr |
Yes |
OpenAI API key
|
| model_name |
str |
No |
Embedding model name (default "text-embedding-ada-002")
|
Inputs (embed_documents)
| Name |
Type |
Required |
Description
|
| elements |
List[Element] |
Yes |
Elements to embed (converted to strings via str())
|
Outputs
| Name |
Type |
Description
|
| return |
List[Element] |
Same elements with element.embeddings populated as list[float] (1536 dimensions for ada-002)
|
Usage Examples
Embed Document Elements
from unstructured.partition.auto import partition
from unstructured.embed.openai import OpenAIEmbeddingConfig, OpenAIEmbeddingEncoder
# 1. Partition document
elements = partition(filename="report.pdf")
# 2. Configure OpenAI embedding
config = OpenAIEmbeddingConfig(api_key="sk-your-key-here")
encoder = OpenAIEmbeddingEncoder(config=config)
encoder.initialize()
# 3. Embed elements
embedded = encoder.embed_documents(elements)
# 4. Access embeddings
for el in embedded:
if el.embeddings:
print(f"{type(el).__name__}: {len(el.embeddings)} dimensions")
Embed a Query for Similarity Search
from unstructured.embed.openai import OpenAIEmbeddingConfig, OpenAIEmbeddingEncoder
config = OpenAIEmbeddingConfig(api_key="sk-your-key-here")
encoder = OpenAIEmbeddingEncoder(config=config)
encoder.initialize()
query_vector = encoder.embed_query("What are the financial results?")
print(f"Query vector dimensions: {len(query_vector)}") # 1536
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.