VA2CONTACT: Visual-auditory Extrinsic Contact Estimation

Video Demonstration
Contact Estimation Demo
TL;DR: VA2CONTACT integrates visual and active audio sensing to accurately detect extrinsic contacts between a grasped object and the environment, even in occluded or ambiguous scenarios, with zero-shot sim-to-real transfer.
Abstract
Robust manipulation often hinges on a robot's ability to perceive extrinsic contacts—contacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to occlusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real-world. To bridge the sim-to-real gap, we introduce a real-to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks.
Method
Our method addresses the challenges of extrinsic contact estimation by combining the strengths of multiple sensing modalities. The system consists of three main components that work together to accurately detect and localize contacts between a grasped object and the environment.

1. Real-to-Sim Audio Hallucination: Since dense contact maps are difficult to obtain in the real world, we train entirely in simulation using ground-truth contact labels, and bridge the gap to real-world acoustics using an audio hallucination technique. To overcome the challenge of simulating audio, we introduce an audio hallucination technique. In the real world, we collect audio signals paired with labeled contact types using teleoperated wiping over various surfaces. In simulation, we generate contact maps and randomly sample real spectrograms corresponding to their type—bridging real-world acoustic signals and simulated labels.
2. VA2Contact: VA2Contact—that predicts a dense contact probability map from depth, optical flow, and an audio spectrogram.
3. Real-World Contact-Aware Policy: We freeze the VA2Contact model and use its contact predictions—along with camera input—as observations for a Diffusion Policy.
Key Features
Active Audio Sensing
Uses conduction speakers and contact microphones to detect contacts even in static scenarios and when visual occlusions are present, providing rich information about contact dynamics that is often invisible to visual sensors.
Occlusion Handling
Maintains high performance even when contact points are visually occluded or in ambiguous near-contact scenarios, thanks to the complementary information provided by audio signals that can travel through solid objects.
Sim-to-Real Transfer
Our real-to-sim audio hallucination technique enables zero-shot transfer from simulation to real-world scenarios, eliminating the need for real-world training data with contact labels while maintaining robust performance.
Results
We evaluated VA2CONTACT on a variety of contact estimation tasks in real-world environments, comparing it against the vision-only Im2Contact method and ablation studies. Our experiments demonstrate that our multimodal approach significantly outperforms vision-only methods, especially in challenging scenarios.

Key findings include:
- Higher recall and F1 scores in general contact detection, indicating superior ability to detect true contacts with fewer false negatives
- Accurate contact detection even in heavily occluded scenarios where vision-only methods fail
- Better performance in ambiguous near-contact scenarios where visual information is insufficient
- Successful zero-shot transfer from simulation to real-world without additional training
- Improved performance in downstream manipulation tasks: a contact-aware diffusion policy for a wiping task achieved 8/10 success rate compared to 4/10 for a vision-only baseline
Team
Contact
For questions about the project, please contact Xili Yi, Jayjun Lee or Nima Fazeli.