Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Ucbepic Docetl Python Runtime

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Pipelines
Last Updated 2026-02-08 01:00 GMT

Overview

Python 3.10+ environment with LiteLLM, Pydantic, pandas, and scikit-learn for running DocETL LLM-powered data processing pipelines.

Description

This environment provides the core Python runtime for DocETL. It is built on Python 3.10 or higher and includes the full dependency stack needed to define, run, and optimize LLM-powered ETL pipelines. The central dependency is LiteLLM (>=1.75.4), which provides a unified interface to 100+ LLM providers. Additional dependencies include Pydantic for schema validation, pandas for data manipulation, scikit-learn for clustering/embedding operations, and diskcache for persistent caching of LLM responses.

Optional dependency groups exist for:

  • Parsing: Document format converters (DOCX, XLSX, PPTX, PDF via PaddleOCR or PyMuPDF)
  • Server: FastAPI backend for the DocWrangler playground
  • Retrieval: LanceDB for retrieval-augmented generation

Usage

Use this environment for any DocETL pipeline execution, whether via CLI (`docetl run`), Python API (`Pipeline(...).run()`), or the DocWrangler playground. It is the mandatory prerequisite for all Implementation pages.

System Requirements

Category Requirement Notes
OS Linux, macOS, Windows Docker uses `python:3.11-slim` (Debian)
Hardware CPU (minimum) GPU optional for sentence-transformer embeddings
Python >= 3.10 Docker image uses 3.11
Disk 2GB+ For dependency installation and LLM response cache

Dependencies

Core Python Packages

  • `litellm` >= 1.75.4
  • `pydantic` >= 2.9.2
  • `pandas` >= 2.3.0
  • `scikit-learn` >= 1.5.2
  • `numpy` >= 1.24.0
  • `scipy` >= 1.10.0
  • `diskcache` >= 5.6.3
  • `typer` >= 0.16.0
  • `rich` >= 13.7.1
  • `tqdm` >= 4.66.4
  • `jsonschema` >= 4.23.0
  • `rapidfuzz` >= 3.10.0
  • `pyrate-limiter` >= 3.7.0
  • `websockets` >= 13.1
  • `pydantic` >= 2.9.2
  • `boto3` >= 1.37.27
  • `rank-bm25` >= 0.2.2
  • `matplotlib` >= 3.7.0
  • `python-Levenshtein` >= 0.21.0
  • `nltk` >= 3.8.0
  • `modal` >= 0.64.0

Optional: Parsing

  • `python-docx` >= 1.1.2
  • `openpyxl` >= 3.1.5
  • `pydub` >= 0.25.1
  • `python-pptx` >= 1.0.2
  • `azure-ai-documentintelligence` >= 1.0.0b4
  • `paddlepaddle` >= 2.6.2, < 3.2
  • `pymupdf` >= 1.24.10

Optional: Server

  • `fastapi` >= 0.115.4
  • `uvicorn` >= 0.31.0
  • `docling` >= 2.5.2
  • `httpx` >= 0.27.2

Optional: Retrieval

  • `lancedb` >= 0.7.0

Credentials

The following environment variables must be set in `.env`:

  • `OPENAI_API_KEY`: API key for OpenAI models (or other LiteLLM-supported providers). Required for any LLM operation.

Optional credentials depending on usage:

  • `DOCETL_ENCRYPTION_KEY`: Decryption key when API keys are stored encrypted in pipeline config.
  • `DOCETL_HOME_DIR`: Override for cache/data storage directory (default: `~`).

Quick Install

# Core installation
pip install docetl

# With all optional dependencies
pip install "docetl[parsing,server,retrieval]"

# Or install from source
git clone https://github.com/ucbepic/docetl.git
cd docetl
pip install -e ".[parsing,server,retrieval]"

Code Evidence

Python version requirement from `pyproject.toml:6`:

requires-python = ">=3.10"

Core dependencies from `pyproject.toml:13-38`:

dependencies = [
  "tqdm>=4.66.4",
  "rich>=13.7.1",
  "litellm>=1.75.4",
  "pydantic>=2.9.2",
  "pandas>=2.3.0",
  "scikit-learn>=1.5.2",
  # ... (24 total packages)
]

LiteLLM Router import guard from `docetl/config_wrapper.py:99-105`:

try:
    from litellm import Router
except ImportError:
    self.console.log(
        f"[yellow]Warning: LiteLLM Router not available. Fallback {router_type} models will be ignored.[/yellow]"
    )
    return None

GPU/device detection from `docetl/operations/clustering_utils.py:44-51`:

import torch
device = "cpu"
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"

Common Errors

Error Message Cause Solution
`requires-python >= 3.10` Python version too old Upgrade to Python 3.10+
`ImportError: litellm` Core dependency missing `pip install docetl`
`Warning: LiteLLM Router not available` LiteLLM version lacks Router `pip install --upgrade litellm>=1.75.4`
`ModuleNotFoundError: paddlepaddle` Parsing extras not installed `pip install "docetl[parsing]"`
`ImportError: matplotlib is required for plotting` Matplotlib missing for Pareto plots `pip install matplotlib>=3.7.0`

Compatibility Notes

  • PaddleOCR: Requires `paddlepaddle >= 2.6.2, < 3.2`. Upper version bound is critical.
  • GPU Embeddings: Sentence-transformer embeddings auto-detect CUDA, Apple MPS, or fall back to CPU.
  • BM25 Fallback: If `rank-bm25` is unavailable, sample operations fall back to scikit-learn TF-IDF.
  • Docker: The Dockerfile uses `python:3.11-slim` with all optional extras pre-installed.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment