Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Diffusers Quantized Pipeline Call

From Leeroopedia

Metadata

Property Value
API pipeline(prompt, **kwargs) with quantized models
Module Pipeline-specific __call__ methods (e.g., src/diffusers/pipelines/flux/pipeline_flux.py)
Import from diffusers import DiffusionPipeline
Type Pattern Doc
Principle Huggingface_Diffusers_Quantized_Inference
Implements Principle:Huggingface_Diffusers_Quantized_Inference

Purpose

This page documents the pattern for running inference with quantized models through the standard Diffusers pipeline API. The key design principle is transparency -- quantized models are called with the exact same API as non-quantized models. No special flags, modes, or parameters are needed at inference time. The quantization is handled internally by the quantized layers during their forward passes.

I/O Contract

The __call__ signature is identical for quantized and non-quantized pipelines. Using FluxPipeline as a representative example:

Input (representative)

Parameter Type Default Description
prompt list[str] None Text prompt(s) for generation
height None model default Output image height in pixels
width None model default Output image width in pixels
num_inference_steps int 28 Number of denoising steps
guidance_scale float 3.5 Classifier-free guidance scale
generator None None Random number generator for reproducibility
output_type str "pil" Output format: "pil", "latent", "pt"

Output

Return Type Description
Pipeline-specific output (e.g., FluxPipelineOutput) Contains generated images and optionally latents

Execution Pattern

The pipeline __call__ executes the same three-phase flow regardless of quantization:

Phase 1: Text Encoding

# Text encoder forward pass -- quantized layers dequantize transparently
prompt_embeds, pooled_prompt_embeds, text_ids = self.encode_prompt(
    prompt=prompt,
    prompt_2=prompt_2,
    device=device,
    num_images_per_prompt=num_images_per_prompt,
    max_sequence_length=max_sequence_length,
)
# Output: standard float tensors in compute dtype (e.g., bfloat16)

If the text encoder is quantized, each of its linear layers internally dequantizes weights to the compute dtype before matrix multiplication. The output embeddings are standard floating-point tensors.

Phase 2: Denoising Loop

# Iterative denoising -- transformer/UNet forward pass at each step
for i, t in enumerate(timesteps):
    # The transformer's quantized layers handle dequantization per-call
    noise_pred = self.transformer(
        hidden_states=latents,
        timestep=timestep,
        encoder_hidden_states=prompt_embeds,
        # ...
    ).sample

    # Scheduler step operates on standard float tensors
    latents = self.scheduler.step(noise_pred, t, latents).prev_sample

This is the most memory-critical phase. The transformer/UNet is typically the largest model component. With quantization:

  • Memory: Weights occupy 2-4x less GPU memory than float16
  • Compute: Each forward pass incurs dequantization overhead, but the actual matmul happens in compute dtype
  • Iterations: The dequantization happens on every denoising step (e.g., 28 times for Flux)

Phase 3: VAE Decoding

# Decode latents to pixel space
latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
latents = (latents / self.vae.config.scaling_factor) + self.vae.config.shift_factor
image = self.vae.decode(latents, return_dict=False)[0]

If the VAE is quantized, its layers also dequantize transparently. However, VAE quantization can have a more noticeable impact on output quality than transformer quantization.

Usage Examples

Standard Quantized Inference

import torch
from diffusers import DiffusionPipeline, BitsAndBytesConfig
from diffusers.quantizers import PipelineQuantizationConfig

# Load pipeline with 4-bit quantized transformer
pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={
        "transformer": BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
        ),
    }
)

pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

# Inference is identical to non-quantized pipelines
image = pipe(
    prompt="A photo of a cat wearing a top hat",
    num_inference_steps=28,
    guidance_scale=3.5,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("quantized_output.png")

Model-Level Quantized Inference

import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig

# Load quantized transformer directly
quantization_config = TorchAoConfig("int8wo")
transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

# Assemble pipeline with quantized component
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

# Same inference API
image = pipe("A serene landscape at sunset", num_inference_steps=28).images[0]

Inspecting Quantization State

# After loading, check quantization status
print(pipe.transformer.is_quantized)       # True
print(pipe.transformer.quantization_method) # QuantizationMethod.TORCHAO
print(pipe.transformer.hf_quantizer)       # TorchAoHfQuantizer instance

# Check if a component is quantized
print(hasattr(pipe.vae, 'hf_quantizer'))   # False (VAE not quantized)

Implementation Notes

  • No quantization-specific code in __call__: The pipeline's __call__ method has zero awareness of quantization. All quantization logic lives in the model layers (replaced during loading) and the quantizer (used only during loading/saving).
  • Device placement: Quantized models follow standard device placement. pipe.to("cuda") moves quantized weights to GPU in their quantized format. Some backends (BitsAndBytes) may enforce specific device placements via update_device_map.
  • torch.compile compatibility: TorchAO quantized models support torch.compile() for potential speedups. Other backends may have limited compilation support (check hf_quantizer.is_compileable).
  • Memory offloading compatibility: Quantized models are compatible with enable_model_cpu_offload() and enable_sequential_cpu_offload(), allowing further memory optimization.
  • Batch inference: Quantized models support batched inference with multiple prompts, identical to non-quantized pipelines.
  • Reproducibility: Using the same seed with quantized vs. non-quantized models will produce different images due to numerical differences from dequantization. Seeds are consistent across runs of the same quantized configuration.

Related Pages

Requires Environment

Source References

  • src/diffusers/pipelines/flux/pipeline_flux.py:L654-L683 - FluxPipeline.__call__ signature (representative)
  • src/diffusers/quantizers/base.py:L243-L245 - is_compileable property
  • src/diffusers/quantizers/base.py:L200-L210 - dequantize method for model recovery

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment