Environment:PeterL1n BackgroundMattingV2 PyTorch CUDA
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Deep_Learning |
| Last Updated | 2026-02-09 02:00 GMT |
Overview
Linux environment with CUDA-capable GPU, Python 3, PyTorch 1.7+, torchvision 0.8+, and kornia 0.4+ for background matting training and inference.
Description
This environment provides the GPU-accelerated runtime required for all BackgroundMattingV2 operations: training, inference, model export, and speed testing. The stack is built on PyTorch with CUDA support and includes specialized computer vision libraries (kornia for differentiable augmentations and loss computation, OpenCV for video I/O). Training uses mixed-precision via torch.cuda.amp (autocast + GradScaler) and optionally distributed data-parallel training via NCCL backend. ONNX export and validation require onnxruntime as an additional optional dependency.
Usage
Use this environment for all BackgroundMattingV2 workflows: base model training, refinement model training, image/video/webcam inference, TorchScript export, ONNX export, and speed benchmarking. Every implementation in this repository calls `.cuda()` or `.to(device)` and requires a CUDA-capable GPU for standard operation. CPU-only inference is supported for image/video pipelines via `--device cpu` but is not practical for real-time use.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | Webcam plugin only works on Linux; Windows needs `num_workers=0` |
| Hardware | NVIDIA GPU | RTX 2080 Ti achieves 4K@30fps, HD@60fps per paper |
| VRAM | 8GB minimum | 16GB+ recommended for training at 2048x2048 resolution |
| Disk | 10GB+ SSD | For datasets, checkpoints, and TensorBoard logs |
Dependencies
System Packages
- CUDA toolkit (compatible with PyTorch 1.7+)
- `ffmpeg` (for video encoding/decoding)
- `v4l2loopback` (optional, for virtual webcam on Linux)
Python Packages
- `torch` == 1.7.0
- `torchvision` == 0.8.1
- `kornia` == 0.4.1
- `tensorboard` == 2.3.0
- `tqdm` == 4.51.0
- `opencv-python` == 4.4.0.44
- `onnxruntime` == 1.6.0 (optional, for ONNX export validation)
Credentials
No credentials or API keys are required. All data and model weights are loaded from local filesystem paths configured in `data_path.py`.
The distributed training backend uses `MASTER_ADDR` and `MASTER_PORT` environment variables, but these are set automatically by the training script (localhost with random port 12300-12399).
Quick Install
pip install torch==1.7.0 torchvision==0.8.1 kornia==0.4.1 tensorboard==2.3.0 tqdm==4.51.0 opencv-python==4.4.0.44
# For ONNX export validation (optional)
pip install onnxruntime==1.6.0
Code Evidence
CUDA usage in training from `train_base.py:121`:
model = MattingBase(args.model_backbone).cuda()
Mixed-precision training setup from `train_base.py:25-26,133,184,188-190`:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
pred_pha, pred_fgr, pred_err = model(true_src, true_bgr)[:3]
loss = compute_loss(pred_pha, pred_fgr, pred_err, true_pha, true_fgr)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Distributed training with NCCL from `train_refine.py:74-75,83-85`:
distributed_num_gpus = torch.cuda.device_count()
assert args.batch_size % distributed_num_gpus == 0
os.environ['MASTER_ADDR'] = addr
os.environ['MASTER_PORT'] = port
dist.init_process_group("nccl", rank=rank, world_size=distributed_num_gpus)
SyncBatchNorm and DistributedDataParallel from `train_refine.py:147-148`:
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
model_distributed = nn.parallel.DistributedDataParallel(model, device_ids=[rank])
Device selection for inference from `inference_video.py:62,115`:
parser.add_argument('--device', type=str, choices=['cpu', 'cuda'], default='cuda')
device = torch.device(args.device)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` | Insufficient VRAM for training resolution | Reduce `--batch-size` or use lower resolution crops |
| `RuntimeError: Expected all tensors to be on the same device` | Mixed CPU/GPU tensors | Ensure all inputs are moved to the same device with `.cuda()` or `.to(device)` |
| `AssertionError: batch_size % distributed_num_gpus == 0` | Batch size not divisible by GPU count | Set `--batch-size` to a multiple of available GPUs |
| `dist.init_process_group` failures | NCCL backend not available | Install PyTorch with CUDA support; ensure NCCL is available |
Compatibility Notes
- Windows: The `inference_images.py` script documents that Windows requires `--num-workers 0` (single-threaded DataLoader) due to multiprocessing limitations.
- CPU inference: Image and video inference scripts support `--device cpu` but performance is not practical for real-time applications.
- Webcam plugin: The virtual camera feature only works on Linux with `v4l2loopback`.
- ONNX Runtime: The authors note ONNX inference is "much slower than PyTorch/TorchScript" and recommend PyTorch or TorchScript for production use.
- Real-time performance: The paper reports 4K@30fps and HD@60fps on an RTX 2080 Ti GPU. The provided video scripts are not real-time due to software video encoding overhead.