Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Unstructured IO Unstructured CI Workflow

From Leeroopedia
Revision as of 11:54, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Unstructured_IO_Unstructured_CI_Workflow.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains CI_CD, Testing, Quality_Assurance
Last Updated 2026-02-12 09:30 GMT

Overview

Concrete tool for continuous integration validation of the Unstructured library provided by GitHub Actions.

Description

The CI Workflow is the primary GitHub Actions workflow that gates all code changes to the Unstructured repository. It defines a multi-stage pipeline triggered on pushes to main, pull requests, and merge queue events. The workflow orchestrates dependency caching, license checking, linting, shell script validation, unit testing across Python 3.11/3.12/3.13, per-extra dependency isolation testing (csv, docx, odt, markdown, pypandoc, pdf-image, pptx, xlsx), end-to-end ingest connector tests, JSON-to-HTML/Markdown conversion tests, changelog enforcement, and Docker image build/scan.

Usage

This workflow executes automatically on every push to main, pull request targeting main, and merge queue event. It is the authoritative quality gate — all jobs must pass before code can be merged. Contributors do not invoke it directly; it is triggered by Git operations against the repository.

Code Reference

Source Location

Signature

name: CI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]
  merge_group:
    branches: [ main ]

permissions:
  id-token: write
  contents: read

env:
  NLTK_DATA: ${{ github.workspace }}/nltk_data

jobs:
  setup:           # Cache dependencies across Python 3.11, 3.12, 3.13
  check-licenses:  # Validate dependency licenses
  lint:            # Run code quality checks (make check)
  shellcheck:      # Validate shell scripts
  shfmt:           # Check shell script formatting
  test_unit:       # Full test suite with system deps
  test_unit_no_extras:        # Minimal dependency tests
  test_unit_dependency_extras: # Per-extra isolation matrix
  test_ingest_src: # End-to-end ingest connector tests
  test_json_to_html:     # HTML fixture validation
  test_json_to_markdown: # Markdown fixture validation
  changelog:       # CHANGELOG.md enforcement
  test_dockerfile: # Docker build + scan

Import

# Not importable — triggered automatically by GitHub Actions
# Manual trigger: gh workflow run CI

I/O Contract

Inputs

Name Type Required Description
push event Git push Yes (one of) Push to main branch triggers full pipeline
pull_request event Git PR Yes (one of) PR targeting main triggers full pipeline
merge_group event Merge queue Yes (one of) Merge queue entry triggers full pipeline
secrets GitHub Secrets Yes API keys for ingest connectors (AWS, Azure, GCP, Salesforce, etc.)

Outputs

Name Type Description
Job statuses Pass/Fail Each job reports success or failure
Coverage report Text Generated by `make check-coverage` in test_unit
Docker scan Table Anchore scan results for critical vulnerabilities

Usage Examples

Triggering via Pull Request

# The CI workflow runs automatically when you open a PR
git checkout -b feature/my-change
git add .
git commit -m "Add new feature"
git push origin feature/my-change
# Open PR on GitHub — CI workflow triggers automatically

Running a Specific Job Locally (Approximation)

# Reproduce the unit test job locally
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
make test CI=true UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
make check-coverage

Running Per-Extra Tests Locally

# Test a specific extra (e.g., pdf-image)
uv sync --frozen --extra pdf --extra image --extra paddleocr --group test
make install-nltk-models
make test-extra-pdf-image CI=true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment