Environment:Unstructured IO Unstructured GitHub Actions
| Knowledge Sources | |
|---|---|
| Domains | CI_CD, Testing, Quality_Assurance |
| Last Updated | 2026-02-12 09:30 GMT |
Overview
The GitHub Actions environment provides the CI/CD execution platform for the Unstructured library's continuous integration pipeline.
Description
The Unstructured CI workflow (.github/workflows/ci.yml) runs on GitHub-hosted runners. The workflow requires id-token: write and contents: read permissions. It sets the NLTK_DATA environment variable to the workspace path for natural language data used by NLTK models during testing.
The CI pipeline consists of 13 jobs: dependency caching (setup), license checking, linting, shellcheck, shfmt, unit tests (full, no-extras, and per-extra matrix), ingest connector tests, JSON conversion tests, changelog enforcement, and Docker build/scan.
The workflow triggers on pushes to main, pull requests targeting main, and merge queue events. All jobs must pass for code to be merged.
Usage
Use this environment specification when reproducing CI failures locally or understanding the system-level dependencies required by the CI pipeline.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Runner | ubuntu-latest (GitHub-hosted) | Standard GitHub Actions runner |
| Python | 3.11, 3.12, 3.13 | Matrix strategy across three versions |
| System packages | libmagic-dev, poppler-utils, libreoffice, tesseract-ocr, tesseract-ocr-kor | Required for document partitioning tests |
| Docker | Available on runner | Required for test_dockerfile job |
| Permissions | id-token: write, contents: read | OIDC token for auth, read-only repo access |
Dependencies
System Packages
- libmagic-dev -- C library for MIME type detection
- poppler-utils -- PDF rendering utilities (pdftotext, pdfimages)
- libreoffice -- Document conversion for DOC, DOCX, PPT, PPTX, ODT
- tesseract-ocr -- OCR engine for image-based text extraction
- tesseract-ocr-kor -- Korean language data for Tesseract
Python Packages
- All packages from pyproject.toml extras, installed via uv sync --frozen
- Test dependencies from the test dependency group
Credentials
- NLTK_DATA -- Set to
$Template:Github.workspace/nltk_datafor NLTK model storage - GitHub Secrets -- API keys for ingest connector tests (AWS, Azure, GCP, Salesforce, etc.)
- CI -- Set to "true" to enable CI-specific test behaviors
Quick Install
# Reproduce the CI environment locally
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
# Install Python dependencies
uv sync --frozen --all-extras --group test
make install-nltk-models
Code Evidence
Workflow trigger and permissions (ci.yml:1-12):
name: CI
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
merge_group:
branches: [ main ]
permissions:
id-token: write
contents: read
NLTK data path (ci.yml):
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data