Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Unstructured IO Unstructured GitHub Actions

From Leeroopedia
Knowledge Sources
Domains CI_CD, Testing, Quality_Assurance
Last Updated 2026-02-12 09:30 GMT

Overview

The GitHub Actions environment provides the CI/CD execution platform for the Unstructured library's continuous integration pipeline.

Description

The Unstructured CI workflow (.github/workflows/ci.yml) runs on GitHub-hosted runners. The workflow requires id-token: write and contents: read permissions. It sets the NLTK_DATA environment variable to the workspace path for natural language data used by NLTK models during testing.

The CI pipeline consists of 13 jobs: dependency caching (setup), license checking, linting, shellcheck, shfmt, unit tests (full, no-extras, and per-extra matrix), ingest connector tests, JSON conversion tests, changelog enforcement, and Docker build/scan.

The workflow triggers on pushes to main, pull requests targeting main, and merge queue events. All jobs must pass for code to be merged.

Usage

Use this environment specification when reproducing CI failures locally or understanding the system-level dependencies required by the CI pipeline.

System Requirements

Category Requirement Notes
Runner ubuntu-latest (GitHub-hosted) Standard GitHub Actions runner
Python 3.11, 3.12, 3.13 Matrix strategy across three versions
System packages libmagic-dev, poppler-utils, libreoffice, tesseract-ocr, tesseract-ocr-kor Required for document partitioning tests
Docker Available on runner Required for test_dockerfile job
Permissions id-token: write, contents: read OIDC token for auth, read-only repo access

Dependencies

System Packages

  • libmagic-dev -- C library for MIME type detection
  • poppler-utils -- PDF rendering utilities (pdftotext, pdfimages)
  • libreoffice -- Document conversion for DOC, DOCX, PPT, PPTX, ODT
  • tesseract-ocr -- OCR engine for image-based text extraction
  • tesseract-ocr-kor -- Korean language data for Tesseract

Python Packages

  • All packages from pyproject.toml extras, installed via uv sync --frozen
  • Test dependencies from the test dependency group

Credentials

  • NLTK_DATA -- Set to $Template:Github.workspace/nltk_data for NLTK model storage
  • GitHub Secrets -- API keys for ingest connector tests (AWS, Azure, GCP, Salesforce, etc.)
  • CI -- Set to "true" to enable CI-specific test behaviors

Quick Install

# Reproduce the CI environment locally
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor

# Install Python dependencies
uv sync --frozen --all-extras --group test
make install-nltk-models

Code Evidence

Workflow trigger and permissions (ci.yml:1-12):

name: CI
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
  merge_group:
    branches: [ main ]

permissions:
  id-token: write
  contents: read

NLTK data path (ci.yml):

env:
  NLTK_DATA: ${{ github.workspace }}/nltk_data

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment