Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:PeterL1n BackgroundMattingV2 PyTorch CUDA

From Leeroopedia


Knowledge Sources
Domains Computer_Vision, Deep_Learning
Last Updated 2026-02-09 02:00 GMT

Overview

Linux environment with CUDA-capable GPU, Python 3, PyTorch 1.7+, torchvision 0.8+, and kornia 0.4+ for background matting training and inference.

Description

This environment provides the GPU-accelerated runtime required for all BackgroundMattingV2 operations: training, inference, model export, and speed testing. The stack is built on PyTorch with CUDA support and includes specialized computer vision libraries (kornia for differentiable augmentations and loss computation, OpenCV for video I/O). Training uses mixed-precision via torch.cuda.amp (autocast + GradScaler) and optionally distributed data-parallel training via NCCL backend. ONNX export and validation require onnxruntime as an additional optional dependency.

Usage

Use this environment for all BackgroundMattingV2 workflows: base model training, refinement model training, image/video/webcam inference, TorchScript export, ONNX export, and speed benchmarking. Every implementation in this repository calls `.cuda()` or `.to(device)` and requires a CUDA-capable GPU for standard operation. CPU-only inference is supported for image/video pipelines via `--device cpu` but is not practical for real-time use.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) Webcam plugin only works on Linux; Windows needs `num_workers=0`
Hardware NVIDIA GPU RTX 2080 Ti achieves 4K@30fps, HD@60fps per paper
VRAM 8GB minimum 16GB+ recommended for training at 2048x2048 resolution
Disk 10GB+ SSD For datasets, checkpoints, and TensorBoard logs

Dependencies

System Packages

  • CUDA toolkit (compatible with PyTorch 1.7+)
  • `ffmpeg` (for video encoding/decoding)
  • `v4l2loopback` (optional, for virtual webcam on Linux)

Python Packages

  • `torch` == 1.7.0
  • `torchvision` == 0.8.1
  • `kornia` == 0.4.1
  • `tensorboard` == 2.3.0
  • `tqdm` == 4.51.0
  • `opencv-python` == 4.4.0.44
  • `onnxruntime` == 1.6.0 (optional, for ONNX export validation)

Credentials

No credentials or API keys are required. All data and model weights are loaded from local filesystem paths configured in `data_path.py`.

The distributed training backend uses `MASTER_ADDR` and `MASTER_PORT` environment variables, but these are set automatically by the training script (localhost with random port 12300-12399).

Quick Install

pip install torch==1.7.0 torchvision==0.8.1 kornia==0.4.1 tensorboard==2.3.0 tqdm==4.51.0 opencv-python==4.4.0.44

# For ONNX export validation (optional)
pip install onnxruntime==1.6.0

Code Evidence

CUDA usage in training from `train_base.py:121`:

model = MattingBase(args.model_backbone).cuda()

Mixed-precision training setup from `train_base.py:25-26,133,184,188-190`:

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast():
    pred_pha, pred_fgr, pred_err = model(true_src, true_bgr)[:3]
    loss = compute_loss(pred_pha, pred_fgr, pred_err, true_pha, true_fgr)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Distributed training with NCCL from `train_refine.py:74-75,83-85`:

distributed_num_gpus = torch.cuda.device_count()
assert args.batch_size % distributed_num_gpus == 0

os.environ['MASTER_ADDR'] = addr
os.environ['MASTER_PORT'] = port
dist.init_process_group("nccl", rank=rank, world_size=distributed_num_gpus)

SyncBatchNorm and DistributedDataParallel from `train_refine.py:147-148`:

model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
model_distributed = nn.parallel.DistributedDataParallel(model, device_ids=[rank])

Device selection for inference from `inference_video.py:62,115`:

parser.add_argument('--device', type=str, choices=['cpu', 'cuda'], default='cuda')
device = torch.device(args.device)

Common Errors

Error Message Cause Solution
`CUDA out of memory` Insufficient VRAM for training resolution Reduce `--batch-size` or use lower resolution crops
`RuntimeError: Expected all tensors to be on the same device` Mixed CPU/GPU tensors Ensure all inputs are moved to the same device with `.cuda()` or `.to(device)`
`AssertionError: batch_size % distributed_num_gpus == 0` Batch size not divisible by GPU count Set `--batch-size` to a multiple of available GPUs
`dist.init_process_group` failures NCCL backend not available Install PyTorch with CUDA support; ensure NCCL is available

Compatibility Notes

  • Windows: The `inference_images.py` script documents that Windows requires `--num-workers 0` (single-threaded DataLoader) due to multiprocessing limitations.
  • CPU inference: Image and video inference scripts support `--device cpu` but performance is not practical for real-time applications.
  • Webcam plugin: The virtual camera feature only works on Linux with `v4l2loopback`.
  • ONNX Runtime: The authors note ONNX inference is "much slower than PyTorch/TorchScript" and recommend PyTorch or TorchScript for production use.
  • Real-time performance: The paper reports 4K@30fps and HD@60fps on an RTX 2080 Ti GPU. The provided video scripts are not real-time due to software video encoding overhead.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment