Learning Guidance Rewards with Trajectory-space Smoothing

Authors: Tanmay Gangwani, Yuan Zhou, Jian Peng

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section evaluates our approach on various single-agent and multi-agent RL tasks to quantify the benefits of using the guidance rewards in place of the environmental rewards, when the latter are sparse or delayed. [...] Figure 3 plots the learning curves for all the algorithms with episodic rewards.
Researcher Affiliation Academia Tanmay Gangwani Dept. of Computer Science UIUC gangwan2@illinois.edu Yuan Zhou Dept. of ISE UIUC yuanz@illinois.edu Jian Peng Dept. of Computer Science UIUC jianpeng@illinois.edu
Pseudocode Yes Algorithm 1: Tabular Q-learning with IRCR, Algorithm 2: Soft Actor-Critic with IRCR
Open Source Code Yes Code for this paper is available at https://github.com/tgangwani/Guidance Rewards
Open Datasets Yes We benchmark high-dimensional, continuous-control locomotion tasks based on the Mu Jo Co physics simulator, provided in Open AI Gym [3] [...] We adopt the Rover Domain from Rahmattalabi et al. [20].
Dataset Splits No No explicit train/validation/test dataset splits are provided for the Mu Jo Co or Rover Domain environments. These are simulation environments where data is generated dynamically, rather than using pre-defined static datasets with fixed splits.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments are provided in the paper.
Software Dependencies No The paper mentions using Mu Jo Co physics simulator and Open AI Gym, and various RL algorithms (Q-learning, Actor-Critic, TD3, SAC, Distributional-RL), but does not specify any software names with version numbers for reproducibility.
Experiment Setup Yes Please see Appendix A.2 for hyperparameters and other details. [...] We experiment with different values for N, K, and the coupling factor (Appendix A.2).