Implementation:Unstructured IO Unstructured CI Workflow
| Knowledge Sources | |
|---|---|
| Domains | CI_CD, Testing, Quality_Assurance |
| Last Updated | 2026-02-12 09:30 GMT |
Overview
Concrete tool for continuous integration validation of the Unstructured library provided by GitHub Actions.
Description
The CI Workflow is the primary GitHub Actions workflow that gates all code changes to the Unstructured repository. It defines a multi-stage pipeline triggered on pushes to main, pull requests, and merge queue events. The workflow orchestrates dependency caching, license checking, linting, shell script validation, unit testing across Python 3.11/3.12/3.13, per-extra dependency isolation testing (csv, docx, odt, markdown, pypandoc, pdf-image, pptx, xlsx), end-to-end ingest connector tests, JSON-to-HTML/Markdown conversion tests, changelog enforcement, and Docker image build/scan.
Usage
This workflow executes automatically on every push to main, pull request targeting main, and merge queue event. It is the authoritative quality gate — all jobs must pass before code can be merged. Contributors do not invoke it directly; it is triggered by Git operations against the repository.
Code Reference
Source Location
- Repository: Unstructured_IO_Unstructured
- File: .github/workflows/ci.yml
- Lines: 1-325
Signature
name: CI
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
merge_group:
branches: [ main ]
permissions:
id-token: write
contents: read
env:
NLTK_DATA: ${{ github.workspace }}/nltk_data
jobs:
setup: # Cache dependencies across Python 3.11, 3.12, 3.13
check-licenses: # Validate dependency licenses
lint: # Run code quality checks (make check)
shellcheck: # Validate shell scripts
shfmt: # Check shell script formatting
test_unit: # Full test suite with system deps
test_unit_no_extras: # Minimal dependency tests
test_unit_dependency_extras: # Per-extra isolation matrix
test_ingest_src: # End-to-end ingest connector tests
test_json_to_html: # HTML fixture validation
test_json_to_markdown: # Markdown fixture validation
changelog: # CHANGELOG.md enforcement
test_dockerfile: # Docker build + scan
Import
# Not importable — triggered automatically by GitHub Actions
# Manual trigger: gh workflow run CI
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| push event | Git push | Yes (one of) | Push to main branch triggers full pipeline |
| pull_request event | Git PR | Yes (one of) | PR targeting main triggers full pipeline |
| merge_group event | Merge queue | Yes (one of) | Merge queue entry triggers full pipeline |
| secrets | GitHub Secrets | Yes | API keys for ingest connectors (AWS, Azure, GCP, Salesforce, etc.) |
Outputs
| Name | Type | Description |
|---|---|---|
| Job statuses | Pass/Fail | Each job reports success or failure |
| Coverage report | Text | Generated by `make check-coverage` in test_unit |
| Docker scan | Table | Anchore scan results for critical vulnerabilities |
Usage Examples
Triggering via Pull Request
# The CI workflow runs automatically when you open a PR
git checkout -b feature/my-change
git add .
git commit -m "Add new feature"
git push origin feature/my-change
# Open PR on GitHub — CI workflow triggers automatically
Running a Specific Job Locally (Approximation)
# Reproduce the unit test job locally
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
make test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
make check-coverage
Running Per-Extra Tests Locally
# Test a specific extra (e.g., pdf-image)
uv sync --frozen --extra pdf --extra image --extra paddleocr --group test
make install-nltk-models
make test-extra-pdf-image CI=true