Heuristic-Guided Reinforcement Learning

Authors: Ching-An Cheng, Andrey Kolobov, Adith Swaminathan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our framework Hu RL experimentally in Mu Jo Co [32] robotics control problems and Procgen games [33], where soft actor critic (SAC) [35] and proximal policy optimization (PPO) [36] were used as the base RL algorithms, respectively5. The goal is to study whether Hu RL can accelerate learning by shortening the horizon with heuristics.
Researcher Affiliation Industry Ching-An Cheng Microsoft Research Redmond, WA chinganc@microsoft.com Andrey Kolobov Microsoft Research Redmond, WA akolobov@microsoft.com Adith Swaminathan Microsoft Research Redmond, WA adswamin@microsoft.com
Pseudocode Yes Algorithm 1 Heuristic-Guided Reinforcement Learning (Hu RL)
Open Source Code Yes Code to replicate all experiments is available at https://github.com/microsoft/Hu RL.
Open Datasets Yes We validate our framework Hu RL experimentally in Mu Jo Co [32] robotics control problems and Procgen games [33]
Dataset Splits No The paper describes hyperparameter tuning and experimental runs but does not provide explicit training, validation, or test dataset splits in terms of percentages or counts, as is typical for static supervised learning datasets. Data is generated through environment interactions in RL.
Hardware Specification Yes All experiments were run on an internal GPU cluster of Microsoft Research, with Nvidia RTX 2080 Ti GPUs and Intel(R) Xeon(R) Gold 6248R CPUs.
Software Dependencies No The paper mentions using Garage [37], Ray [57], Mu Jo Co [32], and Procgen [33] but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The hyperparameters used in the algorithms above were selected as follows. First, the learning rates and the discount factor of the base RL algorithm, SAC, were tuned for each environment. ... For the Hu RL algorithms, the mixing coefficient was scheduled as λn = λ0 + (1 λ0)cω tanh(ω(n 1)), for n = 1, . . . , N, where λ0 [0, 1], ω > 0 controls the increasing rate, and cω is a normalization constant such that λ = 1 and λn [0, 1].