Implementation:Datajuicer Data juicer VideoHandReconstructionHaworMapper
| Knowledge Sources | |
|---|---|
| Domains | Video Processing, 3D Reconstruction, Hand Pose Estimation |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Performs 3D hand reconstruction from video frames using the HaWoR model combined with MoGe-2 for scene geometry, providing detailed hand pose and mesh information for both left and right hands.
Description
VideoHandReconstructionHaworMapper is an advanced video analysis operator that extracts 3D hand pose and shape parameters from video data. It operates in a three-phase pipeline:
- FoV Estimation (MoGe-2) -- Uses the video_camera_calibration_static_moge_mapper sub-operator to estimate per-frame camera field of view (horizontal FoV) and compute focal length from the median FoV across all frames
- Hand Pose and Translation Estimation (HaWoR) -- Performs the core hand reconstruction:
- Detects hands using a YOLO-based detector with configurable confidence threshold
- Tracks hands across frames using YOLO's built-in tracking
- Separates detections into left and right hand tracks based on handedness classification
- Interpolates bounding boxes for missing frames using interpolate_bboxes
- Runs the HaWoR model for 3D hand mesh reconstruction on each track chunk
- Handles left-hand flipping by negating rotation axes for consistency
- Global Translation Recalculation (MANO Alignment) -- Refines global translation by:
- Running the MANO parametric hand model forward pass
- Computing wrist joint positions
- Adjusting translations based on wrist offsets
- Flipping x-axis for left hand consistency
The operator outputs per-frame hand reconstruction parameters for both hands:
- beta (shape parameters)
- hand_pose (joint rotations)
- global_orient (wrist orientation)
- transl (global translation)
During initialization, the operator clones the HaWoR repository, installs required packages (lap, pytorch_lightning, yacs, scikit-image, timm, omegaconf, smplx, chumpy), and downloads the detector model if not present.
Requires CUDA acceleration and the MANO hand model (MANO_RIGHT.pkl from the official MANO website).
Usage
Use this operator for automated 3D hand pose and shape extraction from video data, supporting applications in gesture recognition, hand-object interaction analysis, sign language dataset creation, and hand motion capture.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/video_hand_reconstruction_hawor_mapper.py
- Lines: 1-474
Signature
class VideoHandReconstructionHaworMapper(Mapper):
_accelerator = "cuda"
def __init__(
self,
hawor_model_path: str = "hawor.ckpt",
hawor_config_path: str = "model_config.yaml",
hawor_detector_path: str = "detector.pt",
moge_model_path: str = "Ruicheng/moge-2-vitl",
mano_right_path: str = "path_to_mano_right_pkl",
frame_num: PositiveInt = 3,
duration: float = 0,
thresh: float = 0.2,
tag_field_name: str = MetaKeys.hand_reconstruction_hawor_tags,
frame_dir: str = DATA_JUICER_ASSETS_CACHE,
if_output_moge_info: bool = False,
moge_output_info_dir: str = DATA_JUICER_ASSETS_CACHE,
*args, **kwargs,
):
Import
from data_juicer.ops.mapper.video_hand_reconstruction_hawor_mapper import VideoHandReconstructionHaworMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hawor_model_path | str | No | Path to hawor.ckpt. Default: "hawor.ckpt" |
| hawor_config_path | str | No | Path to model_config.yaml. Default: "model_config.yaml" |
| hawor_detector_path | str | No | Path to detector.pt. Default: "detector.pt" |
| moge_model_path | str | No | Path to MoGe-2 model. Default: "Ruicheng/moge-2-vitl" |
| mano_right_path | str | Yes | Path to MANO_RIGHT.pkl (must be downloaded from https://mano.is.tue.mpg.de/) |
| frame_num | PositiveInt | No | Number of frames to extract. Default: 3 |
| duration | float | No | Duration per segment. 0 means entire video. Default: 0 |
| thresh | float | No | Confidence threshold for hand detection. Default: 0.2 |
| tag_field_name | str | No | Metadata field for storing results. Default: "hand_reconstruction_hawor_tags" |
| frame_dir | str | No | Directory for extracted frames. Default: DATA_JUICER_ASSETS_CACHE |
| if_output_moge_info | bool | No | Whether to save MoGe-2 results. Default: False |
Outputs
| Name | Type | Description |
|---|---|---|
| sample[Fields.meta][tag_field_name]["fov_x"] | float | Median horizontal field of view |
| sample[Fields.meta][tag_field_name]["left_frame_id_list"] | list[int] | Frame indices where left hand was detected |
| sample[Fields.meta][tag_field_name]["left_beta_list"] | list[np.ndarray] | Left hand shape parameters per frame |
| sample[Fields.meta][tag_field_name]["left_hand_pose_list"] | list[np.ndarray] | Left hand joint rotations per frame |
| sample[Fields.meta][tag_field_name]["left_global_orient_list"] | list[np.ndarray] | Left hand global orientation per frame |
| sample[Fields.meta][tag_field_name]["left_transl_list"] | list[np.ndarray] | Left hand global translation per frame |
| sample[Fields.meta][tag_field_name]["right_*"] | list | Same fields for the right hand |
Usage Examples
# Basic usage
mapper = VideoHandReconstructionHaworMapper(
hawor_model_path="/models/hawor.ckpt",
hawor_config_path="/models/model_config.yaml",
hawor_detector_path="/models/detector.pt",
mano_right_path="/models/MANO_RIGHT.pkl",
frame_num=10,
thresh=0.3,
)
# Process a sample
sample = {
"videos": ["/path/to/hand_video.mp4"],
Fields.meta: {},
}
result = mapper.process_single(sample, rank=0)
# Access hand reconstruction data
left_poses = result[Fields.meta]["hand_reconstruction_hawor_tags"]["left_hand_pose_list"]