Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Build Wheels Release

From Leeroopedia
Knowledge Sources
Domains CI_CD, Build_System
Last Updated 2026-02-15 00:00 GMT

Overview

A comprehensive GitHub Actions CI/CD workflow that builds ExLlamaV2 Python wheel packages across 88 matrix configurations spanning multiple operating systems, Python versions, CUDA versions, ROCm versions, and PyTorch versions, with optional upload to GitHub Releases.

Description

This workflow, named "Build Wheels & Release", is the primary release pipeline for the ExLlamaV2 project. It uses a manually triggered workflow_dispatch event and defines an expansive build matrix to produce pre-compiled wheel packages for a wide range of platform and toolkit combinations.

Platform coverage:

  • Ubuntu 22.04 (Linux) -- 49 configurations including CUDA builds, ROCm builds, an sdist, and an extra HF Spaces wheel
  • Windows Server 2022 -- 38 configurations covering CUDA builds only (no ROCm support on Windows)
  • 1 source distribution (sdist) built on Ubuntu with no GPU toolkit

Python version coverage:

  • Python 3.10, 3.11, 3.12, and 3.13 -- though not every Python version is available for every PyTorch/CUDA combination (e.g., Python 3.13 support begins at torch 2.5.0)

CUDA version coverage:

  • CUDA 11.8.0 -- paired with older PyTorch versions (2.3.1 through 2.6.0)
  • CUDA 12.1.0 -- paired with PyTorch 2.3.1 through 2.5.0
  • CUDA 12.4.0 -- paired with PyTorch 2.6.0
  • CUDA 12.8.1 -- paired with the latest PyTorch versions (2.7.0, 2.8.0, 2.9.0)

ROCm version coverage (Ubuntu only):

  • ROCm 5.6 with PyTorch 2.2.2 (Python 3.10-3.11)
  • ROCm 6.0 with PyTorch 2.3.1 (Python 3.10-3.12)
  • ROCm 6.1 with PyTorch 2.4.0 (Python 3.10-3.12)

PyTorch version coverage:

  • PyTorch 2.2.2 through PyTorch 2.9.0 (8 distinct versions total)

CUDA architecture targets:

  • Older CUDA versions target: 6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX (Pascal through Ada Lovelace)
  • CUDA 12.8.1 targets: 6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX (adding Blackwell and Thor)

The build process follows a multi-stage pipeline for each matrix entry:

  1. Free Disk Space -- on Linux runners, reclaim disk by removing Android SDK, .NET, Haskell, and swap (uses jlumbroso/free-disk-space@v1.3.1)
  2. Checkout -- clone the repository via actions/checkout@v4
  3. Version Extraction -- parse exllamav2/version.py using PowerShell regex to extract the __version__ string
  4. VS Build Tools -- on Windows, install VS2022 BuildTools 17.9.7 via Chocolatey for C++ compilation compatibility
  5. Python Setup -- install uv via astral-sh/setup-uv@v5 and configure the target Python version
  6. Toolkit Installation -- conditionally install either ROCm SDK (via apt from repo.radeon.com), Windows CUDA (manual download and extraction of NVIDIA redistributable archives), or Linux CUDA (via Jimver/cuda-toolkit@v0.2.23)
  7. Dependency Installation -- install PyTorch from the appropriate index URL, plus build, setuptools==69.5.1, wheel, packaging, ninja, safetensors, tokenizers, and numpy
  8. Wheel Build -- run python -m build -n --wheel with build tags encoding the CUDA/ROCm version and PyTorch version (e.g., +cu128-torch2.9.0 or +rocm6.1-torch2.4.0). On Windows, the VS Developer Shell is spawned first and DISTUTILS_USE_SDK=1 is set.
  9. Release Upload -- if the release input is set to '1' and the version was parsed successfully, upload all .whl files to a GitHub Release tagged with the version (uses svenstaro/upload-release-action@2.6.1)

The workflow uses fail-fast: false so that a failure in one matrix entry does not cancel other builds. The default shell is PowerShell (pwsh) for cross-platform scripting consistency, with the ROCm build step explicitly using bash.

Usage

Use this workflow when preparing a full release of ExLlamaV2 that must support all historically maintained CUDA versions, ROCm versions, and PyTorch versions. This is the workflow to run when publishing a new version tag to GitHub Releases with maximum platform compatibility. For quick iteration on only the latest PyTorch version, prefer the companion Build Wheels Release Torch Latest workflow instead.

Code Reference

Source Location

Signature

name: Build Wheels & Release

on:
  workflow_dispatch:
    inputs:
      release:
        description: 'Release? 1 = yes, 0 = no'
        default: '0'
        required: true
        type: string

permissions:
  contents: write

jobs:
  build_wheels:
    name: ${{ matrix.os }} P${{ matrix.pyver }} C${{ matrix.cuda }} R${{ matrix.rocm }} T${{ matrix.torch }}
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        include:
          # 88 configurations total:
          # Ubuntu 22.04 CUDA: Python 3.10-3.13 x CUDA 11.8/12.1/12.4/12.8 x Torch 2.3.1-2.9.0
          - { os: ubuntu-22.04, pyver: '3.10', cuda: '11.8.0', torch: '2.3.1', cudaarch: '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX' }
          # ... (40 Ubuntu CUDA entries)
          # Windows 2022 CUDA: Python 3.10-3.13 x CUDA 11.8/12.1/12.4/12.8 x Torch 2.3.1-2.9.0
          - { os: windows-2022, pyver: '3.10', cuda: '11.8.0', torch: '2.3.1', cudaarch: '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX' }
          # ... (38 Windows CUDA entries)
          # ROCm builds (Ubuntu only): ROCm 5.6/6.0/6.1 x Torch 2.2.2-2.4.0
          - { os: ubuntu-22.04, pyver: '3.10', rocm: '5.6', torch: '2.2.2' }
          # ... (8 ROCm entries)
          # sdist and extra HF Spaces wheel
          - { artname: 'sdist', os: ubuntu-22.04, pyver: '3.11', torch: '2.3.1' }
          - { os: ubuntu-22.04, pyver: '3.10', cuda: '12.1.0', torch: '2.2.2' }
      fail-fast: false

Import

# Triggered via GitHub Actions workflow_dispatch
# No import needed - this is a CI/CD pipeline

I/O Contract

Inputs

Name Type Required Description
release string Yes Whether to upload wheels to GitHub release. 1 = yes, 0 = no (default: '0')

Outputs

Name Type Description
wheels .whl files Built Python wheel packages with build tags (e.g., exllamav2-0.2.7+cu128-torch2.9.0-cp313-cp313-linux_x86_64.whl)
sdist .tar.gz file Source distribution built with EXLLAMA_NOCOMPILE=1 (no CUDA compilation)
release GitHub Release Uploaded wheels tagged as v{version} (only when release='1' and version is parsed successfully)

Matrix Parameters

Parameter Type Description
artname string Artifact type: 'wheel' or 'sdist'
os string Runner OS: ubuntu-22.04 or windows-2022
pyver string Python version: '3.10', '3.11', '3.12', or '3.13'
cuda string CUDA toolkit version: '11.8.0', '12.1.0', '12.4.0', '12.8.1', or (empty for ROCm/sdist)
rocm string ROCm SDK version: '5.6', '6.0', '6.1', or (empty for CUDA/sdist)
torch string PyTorch version: '2.2.2' through '2.9.0'
cudaarch string Space-separated CUDA architecture targets (e.g., '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX')

Build Steps Detail

Step 1: Free Disk Space

On Linux runners, the jlumbroso/free-disk-space@v1.3.1 action removes the Android SDK, .NET SDK, Haskell toolchain, and swap storage to reclaim disk for the large CUDA compilation. The large-packages option is set to false to avoid removing system packages that may be needed.

Step 2: Version Extraction

A PowerShell script reads exllamav2/version.py and extracts the version using the regex pattern __version__ = "(\d+\.(?:\d+\.?(?:dev\d+)?)*)". The extracted version is stored as the PACKAGE_VERSION output and used later for the release tag.

Step 3: Platform Toolchain Setup

  • Windows: VS2022 BuildTools 17.9.7 is installed via Chocolatey, pinned to a specific version for wider binary compatibility
  • Windows CUDA: Individual CUDA redistributable archives (cudart, nvcc, nvrtc, cublas, nvtx, profiler_api, cccl, cusparse, cusolver, curand, cufft) are downloaded and extracted manually
  • Linux CUDA: The Jimver/cuda-toolkit@v0.2.23 action installs CUDA via the network method
  • Linux ROCm: The ROCm SDK is installed from repo.radeon.com apt repository with GPG key verification

Step 4: Wheel Build

The wheel is built via python -m build -n --wheel with a build tag appended via egg_info. The tag format is:

  • CUDA: +cu{version}-torch{version} (e.g., +cu128-torch2.9.0)
  • ROCm: +rocm{version}-torch{version} (e.g., +rocm6.1-torch2.4.0)

The TORCH_CUDA_ARCH_LIST environment variable controls which GPU architectures are compiled.

Step 5: Release Upload

When release='1', the svenstaro/upload-release-action@2.6.1 uploads all .whl files from ./dist/ to a GitHub Release. The release is tagged as v{PACKAGE_VERSION} with overwrite: true and file_glob: true.

Usage Examples

Trigger Build Without Release

# Trigger a build-only run (no upload to GitHub Releases)
gh workflow run "Build Wheels & Release" --ref main -f release=0

Trigger Build With Release

# Trigger a full build and upload wheels to GitHub Releases
gh workflow run "Build Wheels & Release" --ref main -f release=1

Monitor Build Progress

# List recent workflow runs
gh run list --workflow="Build Wheels & Release" --limit 5

# Watch a specific run
gh run watch <run-id>

Dependencies

GitHub Actions Used

Action Version Purpose
jlumbroso/free-disk-space v1.3.1 Reclaim disk space on Linux runners
actions/checkout v4 Clone the repository
astral-sh/setup-uv v5 Install uv package manager and configure Python
Jimver/cuda-toolkit v0.2.23 Install CUDA toolkit on Linux
svenstaro/upload-release-action v2.6.1 Upload wheel artifacts to GitHub Releases

Python Build Dependencies

  • torch (version from matrix)
  • build
  • setuptools==69.5.1 (pinned)
  • wheel
  • packaging
  • ninja
  • safetensors
  • tokenizers
  • numpy

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment