reproducibilityindex.ai

Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

Authors: Desik Rengarajan, Gargi Vaidya, Akshay Sarvesh, Dileep Kalathil, Srinivas Shakkottai

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance. In this work, we address this challenging problem by developing an algorithm that exploits the ofﬂine demonstration data generated by a sub-optimal behavior policy for faster and efﬁcient online RL in such sparse reward settings. We perform an exhaustive performance analysis of LOGO, ﬁrst through simulations under four standard (sparsiﬁed) environments on the widely used Mu Jo Co platform (Todorov et al., 2012). Next, we conduct simulations on the Gazebo simulator (Koenig & Howard, 2004) using LOGO for way-point tracking by a robot in environments with and without obstacles, with the only reward being attainment of way points. Finally, we transfer the trained models to a real-world Turtle Bot experiments.
Researcher Affiliation	Academia	Department of Electrical and Computer Engineering, Texas A&M University {desik,gargivaidya,sarvesh,dileep.kalathil,sshakkot}@tamu.edu
Pseudocode	Yes	Algorithm 1 LOGO Algorithm
Open Source Code	Yes	1code base and a video of the Turtle Bot experiments: https://github.com/Desik Rengarajan/LOGO
Open Datasets	Yes	We perform an exhaustive performance analysis of LOGO, ﬁrst through simulations under four standard (sparsiﬁed) environments on the widely used Mu Jo Co platform (Todorov et al., 2012). Note that for all algorithms, we evaluate the ﬁnal performance in the corresponding dense reward environment provided by Open AI Gym, which provides a standardized way of comparing their relative merits.
Dataset Splits	No	The paper does not explicitly provide details about training, validation, and test dataset splits with percentages, sample counts, or specific methodologies for partitioning data. For reinforcement learning environments like MuJoCo and Gazebo, data is typically generated through interaction rather than being drawn from a fixed, pre-split dataset in the traditional supervised learning sense.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the simulations or training the models. It mentions the Turtle Bot 3 as the robotic platform for real-world experiments, but this refers to the subject of the experiment, not the computational hardware used for training.
Software Dependencies	No	The paper states 'We implement all the algorithms in this paper using Py Torch (Paszke et al., 2019)'. While PyTorch is mentioned, specific version numbers for PyTorch or other critical software dependencies are not provided, which is necessary for full reproducibility.
Experiment Setup	Yes	We use a learning rate of 3 × 10−4, a discount factor γ = 0.99, and TRPO parameter δ = 0.01. We decay the inﬂuence of the behavior policy by decaying δk. We start with δ0, and we do not decay δk for the ﬁrst Kδ iterations. For k > Kδ, we geometrically decay δk as δk αδk, whenever the average return in the current iteration is greater than the average return in the past 10 iterations. The rest of the hyperparameters for Mu Jo Co simulations, Gazebo simulation, and real-world experiments are given in table 1. Table 1: Hyperparameters (includes δ0, α, Kδ, Batch Size for different environments).