Touch as Saliency

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

Overview of the RGB-S tactile saliency paradigm
Overview of RGB-S. Instead of treating tactile readings as isolated vectors, RGB-S projects physical contacts into the RGB image plane and renders them as force-aware saliency maps. This gives the policy spatially aligned touch cues when task-relevant visual evidence is hidden.

Key Ideas

Image-Aligned Touch

Tactile nodes are localized with robot kinematics and projected through calibrated cameras into image coordinates.

Force-Aware Saliency

Contacts become dense Gaussian saliency maps whose intensity reflects tactile force magnitude and uncertainty.

Lightweight Fusion

A pretrained RGB encoder is expanded from 3 to 4 input channels, with the saliency channel initialized to zero.

Abstract

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors.

RGB-S explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, tactile sensor locations are projected onto the RGB image plane and rendered as force-modulated Gaussian saliency maps. These 2D spatial anchors are integrated through a zero-initialized conditioning architecture that preserves pretrained visual features while allowing the policy to learn from contact cues.

We evaluate RGB-S on six dexterous manipulation tasks in simulation and the real world under severe visual occlusions. Real-world experiments show that RGB-S improves occluded manipulation success rates by 26.7 percentage points over the strongest implicit visuo-tactile baseline.

RGB and Saliency Rollouts

RGB
Saliency
0:00 / 0:00

RGB-S Architecture

The RGB-S architecture
RGB-S concatenates an RGB image with a tactile saliency map, encodes the 4-channel observation with a shared ResNet-18 visual backbone, compresses features with spatial softmax, and passes the resulting condition to downstream imitation learning policies such as Diffusion Policy, ACT, or MLP behavior cloning.

Experimental Setup

Real-world experiments use an xArm6 robot equipped with a LEAP Hand. The hand provides tactile readings from 12 joint-mounted FSR sensors and 4 fingertip TwinTac sensors, yielding 44 projected tactile nodes. Two calibrated RealSense D435 cameras provide RGB observations.

Policies are trained from normal, unobstructed demonstrations and evaluated under both normal and software-masked occluded observations.

Hardware setup and tactile sensing layout for RGB-S experiments

Simulation and Real-World Tasks

Initialization workspaces for RGB-S simulation and real-world tasks
RGB-S is evaluated on six dexterous manipulation tasks: pick-and-place, cube-push, and rotate-cross in simulation; pick-and-place, open-drawer, and flip-box in the real world. Policies are trained with normal observations and evaluated under both normal and occluded settings.

Robustness Under Visual Occlusion

Real-world RGB-S occlusion evaluation with saliency rendered
During evaluation, a fixed black mask is applied to task-relevant image regions after the highlighted interaction stage. RGB-S keeps tactile evidence spatially available through the saliency channel, helping the policy continue reasoning when RGB observations are compromised.

Real World Rollouts with occlusions

RGB
Saliency
0:00 / 0:00

Results Snapshot

6 Dexterous tasks across simulation and real-world evaluation
51.7% Average real-world occluded success rate for RGB-S
+26.7 Percentage-point gain over the strongest implicit tactile baseline under real occlusion
Method Normal Avg. Occluded Avg.
Vision-Only 56.7% 10.0%
Concat 55.0% 13.3%
Cross-Attn 30.0% 25.0%
Ours (RGB-S) 66.7% 51.7%

Attention Under normal and occluded scenes

Sim

Six fusion methods are shown for each simulated scene, with normal observations on the left and occluded observations on the right.

Grad-CAM color scale for simulation attention maps
Grad-CAM intensity scale

Real

Four real-world fusion methods are shown for each scene, preserving the same single-image size as the simulation grid.

Grad-CAM color scale for real-world attention maps
Grad-CAM intensity scale

Design Ablations

Ablation on Tactile Saliency Rendering

Force-aware RGB-S reaches 78.5% normal and 39.7% occluded success on simulated pick-and-place, outperforming vision-only, RGB overlay, and binary saliency variants.

RGB Overlay

RGB
Saliency

Binary RGB-S

RGB
Saliency

Force-aware RGB-S

0:00 / 0:00
Variant Normal Occluded Average
Vision-only 71.9 7.4 39.7
RGB Overlay 65.3 33.1 49.2
Binary RGB-S 65.3 27.3 46.3
Ours (Force-aware RGB-S) 78.5 39.7 59.1

Ablation on Spatial Misalignment

RGB-S is tolerant to moderate projection noise, but performance drops as tactile saliency becomes severely misaligned with the image.

Setting Condition 0 px 25 px 50 px 100 px
Sim Normal 78.5 66.9 70.2 62.0
Occ. 39.7 32.2 24.0 9.9
Real Normal 9/20 8/20 5/20 5/20
Occ. 7/20 4/20 3/20 3/20

Ablation on Fusion Architecture

Early zero-initialized RGB-S fusion achieves stronger occluded performance than late or intermediate fusion alternatives.

Fusion architecture ablation details
Fusion architecture details from the appendix: feature concatenation, late fusion, and intermediate fusion alternatives used in the ablation study.
Architecture Normal Occ.
Late Fusion 73.6 35.5
Intermediate 73.6 22.3
Ours (Early) 78.5 39.7