Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration
Authors: Desik Rengarajan, Gargi Vaidya, Akshay Sarvesh, Dileep Kalathil, Srinivas Shakkottai
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. We perform an exhaustive performance analysis of LOGO, first through simulations under four standard (sparsified) environments on the widely used Mu Jo Co platform (Todorov et al., 2012). Next, we conduct simulations on the Gazebo simulator (Koenig & Howard, 2004) using LOGO for way-point tracking by a robot in environments with and without obstacles, with the only reward being attainment of way points. Finally, we transfer the trained models to a real-world Turtle Bot experiments. |
| Researcher Affiliation | Academia | Department of Electrical and Computer Engineering, Texas A&M University {desik,gargivaidya,sarvesh,dileep.kalathil,sshakkot}@tamu.edu |
| Pseudocode | Yes | Algorithm 1 LOGO Algorithm |
| Open Source Code | Yes | 1code base and a video of the Turtle Bot experiments: https://github.com/Desik Rengarajan/LOGO |
| Open Datasets | Yes | We perform an exhaustive performance analysis of LOGO, first through simulations under four standard (sparsified) environments on the widely used Mu Jo Co platform (Todorov et al., 2012). Note that for all algorithms, we evaluate the final performance in the corresponding dense reward environment provided by Open AI Gym, which provides a standardized way of comparing their relative merits. |
| Dataset Splits | No | The paper does not explicitly provide details about training, validation, and test dataset splits with percentages, sample counts, or specific methodologies for partitioning data. For reinforcement learning environments like MuJoCo and Gazebo, data is typically generated through interaction rather than being drawn from a fixed, pre-split dataset in the traditional supervised learning sense. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the simulations or training the models. It mentions the Turtle Bot 3 as the robotic platform for real-world experiments, but this refers to the subject of the experiment, not the computational hardware used for training. |
| Software Dependencies | No | The paper states 'We implement all the algorithms in this paper using Py Torch (Paszke et al., 2019)'. While PyTorch is mentioned, specific version numbers for PyTorch or other critical software dependencies are not provided, which is necessary for full reproducibility. |
| Experiment Setup | Yes | We use a learning rate of 3 × 10−4, a discount factor γ = 0.99, and TRPO parameter δ = 0.01. We decay the influence of the behavior policy by decaying δk. We start with δ0, and we do not decay δk for the first Kδ iterations. For k > Kδ, we geometrically decay δk as δk αδk, whenever the average return in the current iteration is greater than the average return in the past 10 iterations. The rest of the hyperparameters for Mu Jo Co simulations, Gazebo simulation, and real-world experiments are given in table 1. Table 1: Hyperparameters (includes δ0, α, Kδ, Batch Size for different environments). |