Implementation:Turboderp org Exllamav2 Build Wheels Release
| Knowledge Sources | |
|---|---|
| Domains | CI_CD, Build_System |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
A comprehensive GitHub Actions CI/CD workflow that builds ExLlamaV2 Python wheel packages across 88 matrix configurations spanning multiple operating systems, Python versions, CUDA versions, ROCm versions, and PyTorch versions, with optional upload to GitHub Releases.
Description
This workflow, named "Build Wheels & Release", is the primary release pipeline for the ExLlamaV2 project. It uses a manually triggered workflow_dispatch event and defines an expansive build matrix to produce pre-compiled wheel packages for a wide range of platform and toolkit combinations.
Platform coverage:
- Ubuntu 22.04 (Linux) -- 49 configurations including CUDA builds, ROCm builds, an sdist, and an extra HF Spaces wheel
- Windows Server 2022 -- 38 configurations covering CUDA builds only (no ROCm support on Windows)
- 1 source distribution (sdist) built on Ubuntu with no GPU toolkit
Python version coverage:
- Python 3.10, 3.11, 3.12, and 3.13 -- though not every Python version is available for every PyTorch/CUDA combination (e.g., Python 3.13 support begins at torch 2.5.0)
CUDA version coverage:
- CUDA 11.8.0 -- paired with older PyTorch versions (2.3.1 through 2.6.0)
- CUDA 12.1.0 -- paired with PyTorch 2.3.1 through 2.5.0
- CUDA 12.4.0 -- paired with PyTorch 2.6.0
- CUDA 12.8.1 -- paired with the latest PyTorch versions (2.7.0, 2.8.0, 2.9.0)
ROCm version coverage (Ubuntu only):
- ROCm 5.6 with PyTorch 2.2.2 (Python 3.10-3.11)
- ROCm 6.0 with PyTorch 2.3.1 (Python 3.10-3.12)
- ROCm 6.1 with PyTorch 2.4.0 (Python 3.10-3.12)
PyTorch version coverage:
- PyTorch 2.2.2 through PyTorch 2.9.0 (8 distinct versions total)
CUDA architecture targets:
- Older CUDA versions target:
6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX(Pascal through Ada Lovelace) - CUDA 12.8.1 targets:
6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX(adding Blackwell and Thor)
The build process follows a multi-stage pipeline for each matrix entry:
- Free Disk Space -- on Linux runners, reclaim disk by removing Android SDK, .NET, Haskell, and swap (uses
jlumbroso/free-disk-space@v1.3.1) - Checkout -- clone the repository via
actions/checkout@v4 - Version Extraction -- parse
exllamav2/version.pyusing PowerShell regex to extract the__version__string - VS Build Tools -- on Windows, install VS2022 BuildTools 17.9.7 via Chocolatey for C++ compilation compatibility
- Python Setup -- install
uvviaastral-sh/setup-uv@v5and configure the target Python version - Toolkit Installation -- conditionally install either ROCm SDK (via apt from repo.radeon.com), Windows CUDA (manual download and extraction of NVIDIA redistributable archives), or Linux CUDA (via
Jimver/cuda-toolkit@v0.2.23) - Dependency Installation -- install PyTorch from the appropriate index URL, plus build, setuptools==69.5.1, wheel, packaging, ninja, safetensors, tokenizers, and numpy
- Wheel Build -- run
python -m build -n --wheelwith build tags encoding the CUDA/ROCm version and PyTorch version (e.g.,+cu128-torch2.9.0or+rocm6.1-torch2.4.0). On Windows, the VS Developer Shell is spawned first andDISTUTILS_USE_SDK=1is set. - Release Upload -- if the
releaseinput is set to'1'and the version was parsed successfully, upload all.whlfiles to a GitHub Release tagged with the version (usessvenstaro/upload-release-action@2.6.1)
The workflow uses fail-fast: false so that a failure in one matrix entry does not cancel other builds. The default shell is PowerShell (pwsh) for cross-platform scripting consistency, with the ROCm build step explicitly using bash.
Usage
Use this workflow when preparing a full release of ExLlamaV2 that must support all historically maintained CUDA versions, ROCm versions, and PyTorch versions. This is the workflow to run when publishing a new version tag to GitHub Releases with maximum platform compatibility. For quick iteration on only the latest PyTorch version, prefer the companion Build Wheels Release Torch Latest workflow instead.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: .github/workflows/build-wheels-release.yml
- Lines: 1-442
Signature
name: Build Wheels & Release
on:
workflow_dispatch:
inputs:
release:
description: 'Release? 1 = yes, 0 = no'
default: '0'
required: true
type: string
permissions:
contents: write
jobs:
build_wheels:
name: ${{ matrix.os }} P${{ matrix.pyver }} C${{ matrix.cuda }} R${{ matrix.rocm }} T${{ matrix.torch }}
runs-on: ${{ matrix.os }}
strategy:
matrix:
include:
# 88 configurations total:
# Ubuntu 22.04 CUDA: Python 3.10-3.13 x CUDA 11.8/12.1/12.4/12.8 x Torch 2.3.1-2.9.0
- { os: ubuntu-22.04, pyver: '3.10', cuda: '11.8.0', torch: '2.3.1', cudaarch: '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX' }
# ... (40 Ubuntu CUDA entries)
# Windows 2022 CUDA: Python 3.10-3.13 x CUDA 11.8/12.1/12.4/12.8 x Torch 2.3.1-2.9.0
- { os: windows-2022, pyver: '3.10', cuda: '11.8.0', torch: '2.3.1', cudaarch: '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX' }
# ... (38 Windows CUDA entries)
# ROCm builds (Ubuntu only): ROCm 5.6/6.0/6.1 x Torch 2.2.2-2.4.0
- { os: ubuntu-22.04, pyver: '3.10', rocm: '5.6', torch: '2.2.2' }
# ... (8 ROCm entries)
# sdist and extra HF Spaces wheel
- { artname: 'sdist', os: ubuntu-22.04, pyver: '3.11', torch: '2.3.1' }
- { os: ubuntu-22.04, pyver: '3.10', cuda: '12.1.0', torch: '2.2.2' }
fail-fast: false
Import
# Triggered via GitHub Actions workflow_dispatch
# No import needed - this is a CI/CD pipeline
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| release | string | Yes | Whether to upload wheels to GitHub release. 1 = yes, 0 = no (default: '0') |
Outputs
| Name | Type | Description |
|---|---|---|
| wheels | .whl files | Built Python wheel packages with build tags (e.g., exllamav2-0.2.7+cu128-torch2.9.0-cp313-cp313-linux_x86_64.whl)
|
| sdist | .tar.gz file | Source distribution built with EXLLAMA_NOCOMPILE=1 (no CUDA compilation)
|
| release | GitHub Release | Uploaded wheels tagged as v{version} (only when release='1' and version is parsed successfully)
|
Matrix Parameters
| Parameter | Type | Description |
|---|---|---|
| artname | string | Artifact type: 'wheel' or 'sdist' |
| os | string | Runner OS: ubuntu-22.04 or windows-2022
|
| pyver | string | Python version: '3.10', '3.11', '3.12', or '3.13' |
| cuda | string | CUDA toolkit version: '11.8.0', '12.1.0', '12.4.0', '12.8.1', or (empty for ROCm/sdist) |
| rocm | string | ROCm SDK version: '5.6', '6.0', '6.1', or (empty for CUDA/sdist) |
| torch | string | PyTorch version: '2.2.2' through '2.9.0' |
| cudaarch | string | Space-separated CUDA architecture targets (e.g., '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX') |
Build Steps Detail
Step 1: Free Disk Space
On Linux runners, the jlumbroso/free-disk-space@v1.3.1 action removes the Android SDK, .NET SDK, Haskell toolchain, and swap storage to reclaim disk for the large CUDA compilation. The large-packages option is set to false to avoid removing system packages that may be needed.
Step 2: Version Extraction
A PowerShell script reads exllamav2/version.py and extracts the version using the regex pattern __version__ = "(\d+\.(?:\d+\.?(?:dev\d+)?)*)". The extracted version is stored as the PACKAGE_VERSION output and used later for the release tag.
Step 3: Platform Toolchain Setup
- Windows: VS2022 BuildTools 17.9.7 is installed via Chocolatey, pinned to a specific version for wider binary compatibility
- Windows CUDA: Individual CUDA redistributable archives (cudart, nvcc, nvrtc, cublas, nvtx, profiler_api, cccl, cusparse, cusolver, curand, cufft) are downloaded and extracted manually
- Linux CUDA: The
Jimver/cuda-toolkit@v0.2.23action installs CUDA via the network method - Linux ROCm: The ROCm SDK is installed from
repo.radeon.comapt repository with GPG key verification
Step 4: Wheel Build
The wheel is built via python -m build -n --wheel with a build tag appended via egg_info. The tag format is:
- CUDA:
+cu{version}-torch{version}(e.g.,+cu128-torch2.9.0) - ROCm:
+rocm{version}-torch{version}(e.g.,+rocm6.1-torch2.4.0)
The TORCH_CUDA_ARCH_LIST environment variable controls which GPU architectures are compiled.
Step 5: Release Upload
When release='1', the svenstaro/upload-release-action@2.6.1 uploads all .whl files from ./dist/ to a GitHub Release. The release is tagged as v{PACKAGE_VERSION} with overwrite: true and file_glob: true.
Usage Examples
Trigger Build Without Release
# Trigger a build-only run (no upload to GitHub Releases)
gh workflow run "Build Wheels & Release" --ref main -f release=0
Trigger Build With Release
# Trigger a full build and upload wheels to GitHub Releases
gh workflow run "Build Wheels & Release" --ref main -f release=1
Monitor Build Progress
# List recent workflow runs
gh run list --workflow="Build Wheels & Release" --limit 5
# Watch a specific run
gh run watch <run-id>
Dependencies
GitHub Actions Used
| Action | Version | Purpose |
|---|---|---|
jlumbroso/free-disk-space |
v1.3.1 | Reclaim disk space on Linux runners |
actions/checkout |
v4 | Clone the repository |
astral-sh/setup-uv |
v5 | Install uv package manager and configure Python |
Jimver/cuda-toolkit |
v0.2.23 | Install CUDA toolkit on Linux |
svenstaro/upload-release-action |
v2.6.1 | Upload wheel artifacts to GitHub Releases |
Python Build Dependencies
torch(version from matrix)buildsetuptools==69.5.1(pinned)wheelpackagingninjasafetensorstokenizersnumpy