Implementation:Turboderp org Exllamav2 Build Wheels Release

Knowledge Sources	Turboderp_org_Exllamav2
Domains	CI_CD, Build_System
Last Updated	2026-02-15 00:00 GMT

Overview

A comprehensive GitHub Actions CI/CD workflow that builds ExLlamaV2 Python wheel packages across 88 matrix configurations spanning multiple operating systems, Python versions, CUDA versions, ROCm versions, and PyTorch versions, with optional upload to GitHub Releases.

Description

This workflow, named "Build Wheels & Release", is the primary release pipeline for the ExLlamaV2 project. It uses a manually triggered workflow_dispatch event and defines an expansive build matrix to produce pre-compiled wheel packages for a wide range of platform and toolkit combinations.

Platform coverage:

Ubuntu 22.04 (Linux) -- 49 configurations including CUDA builds, ROCm builds, an sdist, and an extra HF Spaces wheel
Windows Server 2022 -- 38 configurations covering CUDA builds only (no ROCm support on Windows)
1 source distribution (sdist) built on Ubuntu with no GPU toolkit

Python version coverage:

Python 3.10, 3.11, 3.12, and 3.13 -- though not every Python version is available for every PyTorch/CUDA combination (e.g., Python 3.13 support begins at torch 2.5.0)

CUDA version coverage:

CUDA 11.8.0 -- paired with older PyTorch versions (2.3.1 through 2.6.0)
CUDA 12.1.0 -- paired with PyTorch 2.3.1 through 2.5.0
CUDA 12.4.0 -- paired with PyTorch 2.6.0
CUDA 12.8.1 -- paired with the latest PyTorch versions (2.7.0, 2.8.0, 2.9.0)

ROCm version coverage (Ubuntu only):

ROCm 5.6 with PyTorch 2.2.2 (Python 3.10-3.11)
ROCm 6.0 with PyTorch 2.3.1 (Python 3.10-3.12)
ROCm 6.1 with PyTorch 2.4.0 (Python 3.10-3.12)

PyTorch version coverage:

PyTorch 2.2.2 through PyTorch 2.9.0 (8 distinct versions total)

CUDA architecture targets:

Older CUDA versions target: 6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX (Pascal through Ada Lovelace)
CUDA 12.8.1 targets: 6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX (adding Blackwell and Thor)

The build process follows a multi-stage pipeline for each matrix entry:

Free Disk Space -- on Linux runners, reclaim disk by removing Android SDK, .NET, Haskell, and swap (uses jlumbroso/free-disk-space@v1.3.1)
Checkout -- clone the repository via actions/checkout@v4
Version Extraction -- parse exllamav2/version.py using PowerShell regex to extract the __version__ string
VS Build Tools -- on Windows, install VS2022 BuildTools 17.9.7 via Chocolatey for C++ compilation compatibility
Python Setup -- install uv via astral-sh/setup-uv@v5 and configure the target Python version
Toolkit Installation -- conditionally install either ROCm SDK (via apt from repo.radeon.com), Windows CUDA (manual download and extraction of NVIDIA redistributable archives), or Linux CUDA (via Jimver/cuda-toolkit@v0.2.23)
Dependency Installation -- install PyTorch from the appropriate index URL, plus build, setuptools==69.5.1, wheel, packaging, ninja, safetensors, tokenizers, and numpy
Wheel Build -- run python -m build -n --wheel with build tags encoding the CUDA/ROCm version and PyTorch version (e.g., +cu128-torch2.9.0 or +rocm6.1-torch2.4.0). On Windows, the VS Developer Shell is spawned first and DISTUTILS_USE_SDK=1 is set.
Release Upload -- if the release input is set to '1' and the version was parsed successfully, upload all .whl files to a GitHub Release tagged with the version (uses svenstaro/upload-release-action@2.6.1)

The workflow uses fail-fast: false so that a failure in one matrix entry does not cancel other builds. The default shell is PowerShell (pwsh) for cross-platform scripting consistency, with the ROCm build step explicitly using bash.

Usage

Use this workflow when preparing a full release of ExLlamaV2 that must support all historically maintained CUDA versions, ROCm versions, and PyTorch versions. This is the workflow to run when publishing a new version tag to GitHub Releases with maximum platform compatibility. For quick iteration on only the latest PyTorch version, prefer the companion Build Wheels Release Torch Latest workflow instead.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: .github/workflows/build-wheels-release.yml
Lines: 1-442

Signature

name: Build Wheels & Release

on:
  workflow_dispatch:
    inputs:
      release:
        description: 'Release? 1 = yes, 0 = no'
        default: '0'
        required: true
        type: string

permissions:
  contents: write

jobs:
  build_wheels:
    name: ${{ matrix.os }} P${{ matrix.pyver }} C${{ matrix.cuda }} R${{ matrix.rocm }} T${{ matrix.torch }}
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        include:
          # 88 configurations total:
          # Ubuntu 22.04 CUDA: Python 3.10-3.13 x CUDA 11.8/12.1/12.4/12.8 x Torch 2.3.1-2.9.0
          - { os: ubuntu-22.04, pyver: '3.10', cuda: '11.8.0', torch: '2.3.1', cudaarch: '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX' }
          # ... (40 Ubuntu CUDA entries)
          # Windows 2022 CUDA: Python 3.10-3.13 x CUDA 11.8/12.1/12.4/12.8 x Torch 2.3.1-2.9.0
          - { os: windows-2022, pyver: '3.10', cuda: '11.8.0', torch: '2.3.1', cudaarch: '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX' }
          # ... (38 Windows CUDA entries)
          # ROCm builds (Ubuntu only): ROCm 5.6/6.0/6.1 x Torch 2.2.2-2.4.0
          - { os: ubuntu-22.04, pyver: '3.10', rocm: '5.6', torch: '2.2.2' }
          # ... (8 ROCm entries)
          # sdist and extra HF Spaces wheel
          - { artname: 'sdist', os: ubuntu-22.04, pyver: '3.11', torch: '2.3.1' }
          - { os: ubuntu-22.04, pyver: '3.10', cuda: '12.1.0', torch: '2.2.2' }
      fail-fast: false

Import

# Triggered via GitHub Actions workflow_dispatch
# No import needed - this is a CI/CD pipeline

I/O Contract

Inputs

Name	Type	Required	Description
release	string	Yes	Whether to upload wheels to GitHub release. 1 = yes, 0 = no (default: '0')

Outputs

Name	Type	Description
wheels	.whl files	Built Python wheel packages with build tags (e.g., `exllamav2-0.2.7+cu128-torch2.9.0-cp313-cp313-linux_x86_64.whl`)
sdist	.tar.gz file	Source distribution built with `EXLLAMA_NOCOMPILE=1` (no CUDA compilation)
release	GitHub Release	Uploaded wheels tagged as `v{version}` (only when release='1' and version is parsed successfully)

Matrix Parameters

Parameter	Type	Description
artname	string	Artifact type: 'wheel' or 'sdist'
os	string	Runner OS: `ubuntu-22.04` or `windows-2022`
pyver	string	Python version: '3.10', '3.11', '3.12', or '3.13'
cuda	string	CUDA toolkit version: '11.8.0', '12.1.0', '12.4.0', '12.8.1', or (empty for ROCm/sdist)
rocm	string	ROCm SDK version: '5.6', '6.0', '6.1', or (empty for CUDA/sdist)
torch	string	PyTorch version: '2.2.2' through '2.9.0'
cudaarch	string	Space-separated CUDA architecture targets (e.g., '6.0 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX')

Build Steps Detail

Step 1: Free Disk Space

On Linux runners, the jlumbroso/free-disk-space@v1.3.1 action removes the Android SDK, .NET SDK, Haskell toolchain, and swap storage to reclaim disk for the large CUDA compilation. The large-packages option is set to false to avoid removing system packages that may be needed.

Step 2: Version Extraction

A PowerShell script reads exllamav2/version.py and extracts the version using the regex pattern __version__ = "(\d+\.(?:\d+\.?(?:dev\d+)?)*)". The extracted version is stored as the PACKAGE_VERSION output and used later for the release tag.

Step 3: Platform Toolchain Setup

Windows: VS2022 BuildTools 17.9.7 is installed via Chocolatey, pinned to a specific version for wider binary compatibility
Windows CUDA: Individual CUDA redistributable archives (cudart, nvcc, nvrtc, cublas, nvtx, profiler_api, cccl, cusparse, cusolver, curand, cufft) are downloaded and extracted manually
Linux CUDA: The Jimver/cuda-toolkit@v0.2.23 action installs CUDA via the network method
Linux ROCm: The ROCm SDK is installed from repo.radeon.com apt repository with GPG key verification

Step 4: Wheel Build

The wheel is built via python -m build -n --wheel with a build tag appended via egg_info. The tag format is:

CUDA: +cu{version}-torch{version} (e.g., +cu128-torch2.9.0)
ROCm: +rocm{version}-torch{version} (e.g., +rocm6.1-torch2.4.0)

The TORCH_CUDA_ARCH_LIST environment variable controls which GPU architectures are compiled.

Step 5: Release Upload

When release='1', the svenstaro/upload-release-action@2.6.1 uploads all .whl files from ./dist/ to a GitHub Release. The release is tagged as v{PACKAGE_VERSION} with overwrite: true and file_glob: true.

Usage Examples

Trigger Build Without Release

# Trigger a build-only run (no upload to GitHub Releases)
gh workflow run "Build Wheels & Release" --ref main -f release=0

Trigger Build With Release

# Trigger a full build and upload wheels to GitHub Releases
gh workflow run "Build Wheels & Release" --ref main -f release=1

Monitor Build Progress

# List recent workflow runs
gh run list --workflow="Build Wheels & Release" --limit 5

# Watch a specific run
gh run watch <run-id>

Dependencies

GitHub Actions Used

Action	Version	Purpose
`jlumbroso/free-disk-space`	v1.3.1	Reclaim disk space on Linux runners
`actions/checkout`	v4	Clone the repository
`astral-sh/setup-uv`	v5	Install uv package manager and configure Python
`Jimver/cuda-toolkit`	v0.2.23	Install CUDA toolkit on Linux
`svenstaro/upload-release-action`	v2.6.1	Upload wheel artifacts to GitHub Releases

Python Build Dependencies

torch (version from matrix)
build
setuptools==69.5.1 (pinned)
wheel
packaging
ninja
safetensors
tokenizers
numpy

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment