Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Active Reinforcement Learning Strategies for Offline Policy Improvement

Authors: Ambedkar Dukkipati, Ranga Shaarad Ayyagari, Bodhisattwa Dasgupta, Parag Dutta, Prabhas Reddy Onteru

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With extensive experimentation, we demonstrate that our proposed method reduces additional online interaction with the environment by up to 75% over competitive baselines across various continuous control environments such as Gym-Mu Jo Co locomotion environments as well as Maze2d, Ant Maze, CARLA and Isaac Sim Go1. To the best of our knowledge, this is the first work that addresses the active learning problem in the context of sequential decision-making and reinforcement learning.
Researcher Affiliation Academia Department of Computer Science and Automation, Indian Institute of Science EMAIL
Pseudocode Yes The procedure for active exploration is listed in Algorithm 1. Algorithm 1: Active Offline Reinforcement Learning
Open Source Code Yes Code https://github.com/sml-iisc/Active RL
Open Datasets Yes D4RL (Nair et al. 2020) is a collection of offline datasets for training and testing offline RL algorithms. ... For the Isaac Sim experiments, we use the legged_gym API (Rudin et al. 2022) to simulate Unitree Go1 robots.
Dataset Splits No To validate the performance of our active algorithm in the context of limited data, we prune these datasets and create new smaller versions. We prune the medium and large Maze2d datasets by removing trajectories near the goal state. ... Additionally, we randomly subsample 30% of the trajectories in the Ant Maze datasets and the random and medium versions of the locomotion datasets.
Hardware Specification No Isaac Sim Go1: A GPU-based simulator to control a legged 4 3 DOF quadrupedal robot using proprioceptive measurements along with ego-centric height information of the terrain. ... The figures display the terrains for the Unitree Go1 robot experiments in the Nvidia Isaac Simulator.
Software Dependencies No For the offline phase of our algorithm, we use (i) TD3+BC (Fujimoto and Gu 2021), (ii) IQL (Kostrikov, Nair, and Levine 2022), (iii) CQL, and (iv) Behavior Cloning, as the base offline RL algorithms. ... For legged locomotion, we use BPPO (Zhuang et al. 2023) as the offline policy learning algorithm in the active phase.
Experiment Setup Yes L = log(σ(v v+)) + log(1 σ(v v )) λ||ˆv+ v+||2, where σ(x) = 1/(1 + exp ( x)) is the sigmoid function and λ is a hyper-parameter. ... In the fine-tuning phase, the same training is continued on the newly collected data, with the α value being exponentially decayed to deal with the distribution shift (Beeson and Montana 2022). ... The degree of exploration is controlled by an ϵ-greedy variant of the exploration policy that explores using the aforementioned environment-aware uncertainty-based procedure with probability ϵ, and simply follows the policy π otherwise.