reproducibilityindex.ai

Heuristic-Guided Reinforcement Learning

Authors: Ching-An Cheng, Andrey Kolobov, Adith Swaminathan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our framework Hu RL experimentally in Mu Jo Co [32] robotics control problems and Procgen games [33], where soft actor critic (SAC) [35] and proximal policy optimization (PPO) [36] were used as the base RL algorithms, respectively5. The goal is to study whether Hu RL can accelerate learning by shortening the horizon with heuristics.
Researcher Affiliation	Industry	Ching-An Cheng Microsoft Research Redmond, WA chinganc@microsoft.com Andrey Kolobov Microsoft Research Redmond, WA akolobov@microsoft.com Adith Swaminathan Microsoft Research Redmond, WA adswamin@microsoft.com
Pseudocode	Yes	Algorithm 1 Heuristic-Guided Reinforcement Learning (Hu RL)
Open Source Code	Yes	Code to replicate all experiments is available at https://github.com/microsoft/Hu RL.
Open Datasets	Yes	We validate our framework Hu RL experimentally in Mu Jo Co [32] robotics control problems and Procgen games [33]
Dataset Splits	No	The paper describes hyperparameter tuning and experimental runs but does not provide explicit training, validation, or test dataset splits in terms of percentages or counts, as is typical for static supervised learning datasets. Data is generated through environment interactions in RL.
Hardware Specification	Yes	All experiments were run on an internal GPU cluster of Microsoft Research, with Nvidia RTX 2080 Ti GPUs and Intel(R) Xeon(R) Gold 6248R CPUs.
Software Dependencies	No	The paper mentions using Garage [37], Ray [57], Mu Jo Co [32], and Procgen [33] but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	The hyperparameters used in the algorithms above were selected as follows. First, the learning rates and the discount factor of the base RL algorithm, SAC, were tuned for each environment. ... For the Hu RL algorithms, the mixing coefﬁcient was scheduled as λn = λ0 + (1 λ0)cω tanh(ω(n 1)), for n = 1, . . . , N, where λ0 [0, 1], ω > 0 controls the increasing rate, and cω is a normalization constant such that λ = 1 and λn [0, 1].