RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

Overview of the RGB-S tactile saliency paradigm — **Overview of RGB-S.** Instead of treating tactile readings as isolated vectors, RGB-S projects physical contacts into the RGB image plane and renders them as force-aware saliency maps. This gives the policy spatially aligned touch cues when task-relevant visual evidence is hidden.

Key Ideas

Image-Aligned Touch

Tactile nodes are localized with robot kinematics and projected through calibrated cameras into image coordinates.

Force-Aware Saliency

Contacts become dense Gaussian saliency maps whose intensity reflects tactile force magnitude and uncertainty.

Lightweight Fusion

A pretrained RGB encoder is expanded from 3 to 4 input channels, with the saliency channel initialized to zero.

Abstract

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors.

RGB-S explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, tactile sensor locations are projected onto the RGB image plane and rendered as force-modulated Gaussian saliency maps. These 2D spatial anchors are integrated through a zero-initialized conditioning architecture that preserves pretrained visual features while allowing the policy to learn from contact cues.

We evaluate RGB-S on six dexterous manipulation tasks in simulation and the real world under severe visual occlusions. Real-world experiments show that RGB-S improves occluded manipulation success rates by 26.7 percentage points over the strongest implicit visuo-tactile baseline.

RGB and Saliency Rollouts

RGB

Saliency

0:00 / 0:00

RGB-S Architecture

Experimental Setup

Real-world experiments use an xArm6 robot equipped with a LEAP Hand. The hand provides tactile readings from 12 joint-mounted FSR sensors and 4 fingertip TwinTac sensors, yielding 44 projected tactile nodes. Two calibrated RealSense D435 cameras provide RGB observations.

Policies are trained from normal, unobstructed demonstrations and evaluated under both normal and software-masked occluded observations.

Hardware setup and tactile sensing layout for RGB-S experiments

Simulation and Real-World Tasks

Robustness Under Visual Occlusion

Real-world RGB-S occlusion evaluation with saliency rendered — During evaluation, a fixed black mask is applied to task-relevant image regions after the highlighted interaction stage. RGB-S keeps tactile evidence spatially available through the saliency channel, helping the policy continue reasoning when RGB observations are compromised.

Real World Rollouts with occlusions

RGB

Saliency

0:00 / 0:00

Results Snapshot

6 Dexterous tasks across simulation and real-world evaluation

51.7% Average real-world occluded success rate for RGB-S

+26.7 Percentage-point gain over the strongest implicit tactile baseline under real occlusion

Method	Normal Avg.	Occluded Avg.
Vision-Only	56.7%	10.0%
Concat	55.0%	13.3%
Cross-Attn	30.0%	25.0%
Ours (RGB-S)	66.7%	51.7%

Attention Under normal and occluded scenes

Sim

Six fusion methods are shown for each simulated scene, with normal observations on the left and occluded observations on the right.

Grad-CAM color scale for simulation attention maps — Grad-CAM intensity scale

Real

Four real-world fusion methods are shown for each scene, preserving the same single-image size as the simulation grid.

Design Ablations

Ablation on Tactile Saliency Rendering

Force-aware RGB-S reaches 78.5% normal and 39.7% occluded success on simulated pick-and-place, outperforming vision-only, RGB overlay, and binary saliency variants.

RGB Overlay

RGB

Saliency

Binary RGB-S

RGB

Saliency

Force-aware RGB-S

0:00 / 0:00

Variant	Normal	Occluded	Average
Vision-only	71.9	7.4	39.7
RGB Overlay	65.3	33.1	49.2
Binary RGB-S	65.3	27.3	46.3
Ours (Force-aware RGB-S)	78.5	39.7	59.1

Ablation on Spatial Misalignment

RGB-S is tolerant to moderate projection noise, but performance drops as tactile saliency becomes severely misaligned with the image.

Setting	Condition	0 px	25 px	50 px	100 px
Sim	Normal	78.5	66.9	70.2	62.0
Sim	Occ.	39.7	32.2	24.0	9.9
Real	Normal	9/20	8/20	5/20	5/20
Real	Occ.	7/20	4/20	3/20	3/20

Ablation on Fusion Architecture

Early zero-initialized RGB-S fusion achieves stronger occluded performance than late or intermediate fusion alternatives.

Fusion architecture ablation details — Fusion architecture details from the appendix: feature concatenation, late fusion, and intermediate fusion alternatives used in the ablation study.

Architecture	Normal	Occ.
Late Fusion	73.6	35.5
Intermediate	73.6	22.3
Ours (Early)	78.5	39.7