Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Real-World Reinforcement Learning of Active Perception Behaviors

Authors: Edward Hu, Jie Wang, Xingfang Yuan, Fiona Luo, Muyao Li, Gaspard Lambrechts, Oleh Rybkin, Dinesh Jayaraman

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In evaluations on 8 manipulation tasks on 3 robots spanning varying degrees of partial observability, AAWR synthesizes reliable active perception behaviors that outperform all prior approaches.
Researcher Affiliation Academia 1University of Pennsylvania 2University of Liège 3UC Berkeley
Pseudocode Yes Algorithm 1 AAWR Offline-to-Online Training Algorithm 2 Deployment
Open Source Code No We will release code for the algorithm and environments.
Open Datasets Yes We used the DROID robot setup[56], which consists of a 7 Do F Franka Emika Panda Robot Arm, a Robotiq 2F-85 parallel-jaw gripper, a wrist-mounted ZED Mini RGB-D camera and two side-mounted ZED 2 stereo cameras. The DROID set-up enables the usage of the generalist VLA policy Ο€0 [57], specifically the FAST-DROID checkpoint.
Dataset Splits No We initially collect up to 250 demonstrations per task, but then we curate the dataset, dropping out trajectories with mislabeled object detections, noisy/faulty sensor readings, etc. After filtering, we end up with 152 demonstrations for Bookshelf-P, 109 for Bookshelf-D, 35 for Shelf-Cabinet, and 195 for Complex. ... We sample an equal number of transitions from both buffers to form a batch, following best practice from prior work [47, 51].
Hardware Specification No It used computing resources from the National Artificial Intelligence Research Resource Pilot (NAIRR 240077).
Software Dependencies No The wrist image is first fed into a frozen DINO-V2[58] encoder (Vi T-S14) ... To obtain object detection and segmentation of the target object, we used the DINO-X [59] API and the Grounded SAM [60] Model for Open-World Object Detection and segmentation. ... We query the Gemini-2.5-Flash [53] model with a task prompt template...
Experiment Setup Yes We train all models with a batch size of 256, learning rate of 0.0001, and the Adam optimizer. For online finetuning following [47], we use an update-to-date ratio of 1 , performing gradient updates after every episode. For AWR and AAWR, we use an advantage temperature of 10.