High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

Bo Peng*1,2     Xu Chen1     Yi Gu1,3     Hidenobu Matsuki1     Mingsong Dou1     Jingjing Shen1     Deying Kong1     Juyong Zhang2     Zhengyang Shen*1    
1Google XR
2University of Science and Technology of China (USTC)     3The Hong Kong University of Science and Technology (Guangzhou)
*Equal contribution

Our system robustly reconstructs 4D hand-object interactions from multi-view videos without requiring any pre-scanned object templates or physical markers, decomposing appearance to recover pristine 3D geometry.

Interactive 3D Visualizer

Interactive 3D reconstruction showing hand-object pose and mesh alignment. (Some sequences include markers for evaluation purposes only.)

Abstract

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios.

We propose a novel system for the robust and accurate reconstruction of hands and objects from multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) HOST, a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable initialization; and (2) HOPG, a hand-object physics-aware Gaussian-based optimization framework integrating tetrahedral constraints and collision refinement to produce physically plausible and visually accurate reconstruction.

Pipeline Overview

Framework Pipeline

Our pipeline operates in two stages. First, the Hand Object Spatiotemporal Transformer (HOST) processes multi-view videos to robustly regress parametric hand/object poses, dense object point clouds, and segmentation masks, yielding a metric 3D initialization. Second, the Hand Object Physics-aware Gaussian (HOPG) module leverages this initialization to optimize a hybrid 2D Gaussian representation. By enforcing structural constraints and explicitly decoupling diffuse and specular appearance, HOPG effectively prevents illumination bake-in.

Robust Tracking with HOST

Appearance Decomposition & Rendering

HOPG achieves photorealistic rendering by decomposing appearance into diffuse and specular components. This explicit lighting decoupling prevents artifacts from baking into the geometry, yielding clean normal maps and high-fidelity 3D assets.

High-Fidelity Object Geometry

Detailed 3D meshes of complex, real-world objects extracted via TSDF fusion from our refined 2D Gaussian representation. Even without templates, our method recovers intricate geometric details and preserves structural integrity under complex occlusions.

BibTeX

If you find our work useful, please cite our paper:

@inproceedings{todo
}