Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Active Reinforcement Learning Strategies for Offline Policy Improvement

Authors: Ambedkar Dukkipati, Ranga Shaarad Ayyagari, Bodhisattwa Dasgupta, Parag Dutta, Prabhas Reddy Onteru

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With extensive experimentation, we demonstrate that our proposed method reduces additional online interaction with the environment by up to 75% over competitive baselines across various continuous control environments such as Gym-Mu Jo Co locomotion environments as well as Maze2d, Ant Maze, CARLA and Isaac Sim Go1. To the best of our knowledge, this is the first work that addresses the active learning problem in the context of sequential decision-making and reinforcement learning.
Researcher Affiliation	Academia	Department of Computer Science and Automation, Indian Institute of Science EMAIL
Pseudocode	Yes	The procedure for active exploration is listed in Algorithm 1. Algorithm 1: Active Offline Reinforcement Learning
Open Source Code	Yes	Code https://github.com/sml-iisc/Active RL
Open Datasets	Yes	D4RL (Nair et al. 2020) is a collection of offline datasets for training and testing offline RL algorithms. ... For the Isaac Sim experiments, we use the legged_gym API (Rudin et al. 2022) to simulate Unitree Go1 robots.
Dataset Splits	No	To validate the performance of our active algorithm in the context of limited data, we prune these datasets and create new smaller versions. We prune the medium and large Maze2d datasets by removing trajectories near the goal state. ... Additionally, we randomly subsample 30% of the trajectories in the Ant Maze datasets and the random and medium versions of the locomotion datasets.
Hardware Specification	No	Isaac Sim Go1: A GPU-based simulator to control a legged 4 3 DOF quadrupedal robot using proprioceptive measurements along with ego-centric height information of the terrain. ... The figures display the terrains for the Unitree Go1 robot experiments in the Nvidia Isaac Simulator.
Software Dependencies	No	For the offline phase of our algorithm, we use (i) TD3+BC (Fujimoto and Gu 2021), (ii) IQL (Kostrikov, Nair, and Levine 2022), (iii) CQL, and (iv) Behavior Cloning, as the base offline RL algorithms. ... For legged locomotion, we use BPPO (Zhuang et al. 2023) as the offline policy learning algorithm in the active phase.
Experiment Setup	Yes	L = log(σ(v v+)) + log(1 σ(v v )) λ\|\|ˆv+ v+\|\|2, where σ(x) = 1/(1 + exp ( x)) is the sigmoid function and λ is a hyper-parameter. ... In the fine-tuning phase, the same training is continued on the newly collected data, with the α value being exponentially decayed to deal with the distribution shift (Beeson and Montana 2022). ... The degree of exploration is controlled by an ϵ-greedy variant of the exploration policy that explores using the aforementioned environment-aware uncertainty-based procedure with probability ϵ, and simply follows the policy π otherwise.