Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:PeterL1n BackgroundMattingV2 Video matting inference

From Leeroopedia


Knowledge Sources
Domains Computer_Vision, Image_Matting, Video_Processing, Inference
Last Updated 2026-02-09 02:30 GMT

Overview

Video matting pipeline that processes a video frame-by-frame against a static background image, producing alpha mattes, foreground layers, and composites as MP4 video or image sequence output.

Description

This workflow performs per-frame background matting on a video. The input consists of a source video (containing a subject in front of a known static background) and a single background image (the same scene without the subject). Each frame is independently processed through the matting model to predict alpha and foreground, which can be composited onto a target background (either a provided video or a default solid color).

The pipeline supports two output formats: MP4 video files (using OpenCV VideoWriter with mp4v codec) or image sequences (numbered frames saved as individual files). Both MattingBase and MattingRefine models are supported. Optional homographic alignment corrects minor camera shifts between the video and background capture.

Usage

Execute this workflow when you have a video shot against a known static background and need to extract the subject for compositing onto a different background. Typical use cases include virtual background replacement for pre-recorded video, visual effects compositing, green-screen-free video production where a clean plate was captured before or after filming, and batch video processing for content creation.

Execution Steps

Step 1: Model loading

Instantiate the matting model (MattingBase or MattingRefine) with the desired backbone and refinement configuration. Load trained checkpoint weights and set the model to evaluation mode on the target device. Configuration choices affect the quality-computation trade-off for each frame.

Key considerations:

  • For HD video (1080p): backbone_scale=0.25, refine_sample_pixels=80000
  • For 4K video: backbone_scale=0.125, refine_sample_pixels=320000
  • MobileNetV2 backbone is faster but produces slightly lower quality than ResNet50/101

Step 2: Video and background loading

Load the source video as a frame-iterable dataset using the VideoDataset class (backed by the pims library for frame reading). Load the static background as a single PIL image that will be broadcast to pair with every video frame. Optionally apply resolution resizing and homographic alignment. If a target background video is provided for compositing, load it as a second VideoDataset.

Key considerations:

  • The background is a single image that pairs with every frame
  • Video frames and background can be optionally resized to a target resolution
  • Homographic alignment corrects camera drift between video and background capture
  • A target background video enables frame-by-frame compositing onto dynamic backgrounds

Step 3: Output writer initialization

Create output writers based on the chosen format (video or image sequence). For video output, initialize OpenCV VideoWriter instances with the source video's frame rate and resolution for each output type. For image sequence output, create numbered file writers that save individual frames as PNG (composites) or JPEG (other outputs) using threaded async writing.

Output types:

  • com - Composite (subject on target background for video, RGBA for image sequences)
  • pha - Alpha matte (grayscale)
  • fgr - Foreground RGB
  • err - Error prediction map (upsampled to full resolution)
  • ref - Refinement region map (upsampled to full resolution)

Step 4: Frame-by-frame inference

Iterate through video frames in a DataLoader with batch size 1. For each frame, transfer the source-background pair to GPU and run the matting model in no-gradient mode. Route each output tensor to the appropriate writer. For video compositing, blend the predicted foreground and alpha with the target background using the standard compositing equation.

What happens:

  • Each frame is processed independently (no temporal modeling)
  • Compositing equation: com = fgr * pha + target_bgr * (1 - pha)
  • Default target background is a solid green color (RGB: 120/255, 255/255, 155/255)
  • Writers handle format conversion (tensor to numpy/PIL) and encoding

Step 5: Output finalization

After all frames are processed, video writers flush their buffers and close the output files. Image sequence writers complete any remaining async write operations. The output directory contains separate files or subdirectories for each requested output type.

Key considerations:

  • Video encoding uses mp4v codec via OpenCV (software encoding, not hardware-accelerated)
  • For production use, additional engineering for hardware encoding and parallel frame loading is recommended
  • The script is not designed for real-time operation; see the inference_speed_test for throughput measurement

Execution Diagram

GitHub URL

Workflow Repository