Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing Tactile-based Reinforcement Learning for Robotic Control

Authors: Elle Miller, Trevor McInroe, David Abel, Oisin Mac Aodha, Sethu Vijayakumar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that sparse binary tactile signals are critical for dexterity, showing they can significantly improve performance beyond what is achievable with proprioceptive history alone. Our agents achieve superhuman dexterity in complex contact tasks (ball bouncing and Baoding ball rotation). Figures 3-5 depict the mean evaluation return across 5 seeds, with 1 standard deviation shaded. We evaluate our four proposed SSL objectives: Tactile Reconstruction (TR), Full Reconstruction (FR), Forward Dynamics (FD), and Tactile Forward Dynamics (TFD) (Section 3.2).
Researcher Affiliation Academia Elle Miller Trevor Mc Inroe David Abel Oisin Mac Aodha Sethu Vijayakumar University of Edinburgh
Pseudocode No The paper describes the proposed self-supervised objectives and their loss functions (e.g., LTR, LFR, LFD, LTD) mathematically and textually. It details the problem setting and the implementation of RL, but it does not include any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No We release our environments and baselines as an open-source benchmark, the Robot Tactile Olympiad (Ro TO) to foster progress in tactile-based manipulation. Project page: https://elle-miller.github.io/tactile_rl. We will release all code and data upon acceptance, with scripts for generating each figure from the data and running each experiment. We will release our code as a benchmark for the community upon acceptance. The code will be well presented and documented.
Open Datasets Yes We introduce the Robot Tactile Olympiad (Ro TO) benchmark, comprising three challenging Isaac Lab environments (Find, Bounce, and Baoding), with tuned baselines and integrated hyperparameter optimisation to standardise and inspire future research in tactile manipulation. We release our environments and baselines as an open-source benchmark, the Robot Tactile Olympiad (Ro TO) to foster progress in tactile-based manipulation.
Dataset Splits No The paper uses Proximal Policy Optimisation (PPO), which is an on-policy reinforcement learning algorithm. Data is generated dynamically through agent-environment interactions (rollouts) rather than being split from a static, pre-defined dataset into training, testing, and validation sets. The text mentions "on-policy RL memory" and "auxiliary memory" for data storage during training but not fixed dataset splits.
Hardware Specification Yes Experiments were executed on a GPU cluster (8x NVIDIA RTX A4500s). The simulation environment (Isaac Lab) required 16GB VRAM, 32GB RAM, and 8 CPU cores.
Software Dependencies No We use a customised implementation of Proximal Policy Optimisation (PPO) [50] from SKRL [52] to incorporate observation stacking, self-supervision, and separated environments for continuous evaluation. We evaluate our method on three custom robotic manipulation tasks implemented within Isaac Lab [42]. The paper mentions SKRL and Isaac Lab as key software components but does not provide specific version numbers for these or any other libraries or programming languages.
Experiment Setup Yes We use a customised implementation of Proximal Policy Optimisation (PPO) [50] from SKRL [52] to incorporate observation stacking, self-supervision, and separated environments for continuous evaluation. We using 4096 parallelised environments for training and 100 for evaluation. Hyperparameters. To account for the fundamental changes introduced by the self-supervision on the state representation, we conducted an individual hyperparameter sweep for every environment and method combination, aligning with best practices for RL research [14]. Each sweep comprised 20 trials using the TPE sampler with 5 startup trials. The swept hyperparameters included PPO hyperparameters (learning rate lr, rollout length, number of minibatches, number of learning epochs, entropy loss scale cent), self-supervision hyperparameters (learning rate lraux, loss weight caux), and for forward dynamics objectives, the sequence length n. All sweep information is in Appendix G.