Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Heuristic-Guided Reinforcement Learning
Authors: Ching-An Cheng, Andrey Kolobov, Adith Swaminathan
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our framework Hu RL experimentally in Mu Jo Co [32] robotics control problems and Procgen games [33], where soft actor critic (SAC) [35] and proximal policy optimization (PPO) [36] were used as the base RL algorithms, respectively5. The goal is to study whether Hu RL can accelerate learning by shortening the horizon with heuristics. |
| Researcher Affiliation | Industry | Ching-An Cheng Microsoft Research Redmond, WA EMAIL Andrey Kolobov Microsoft Research Redmond, WA EMAIL Adith Swaminathan Microsoft Research Redmond, WA EMAIL |
| Pseudocode | Yes | Algorithm 1 Heuristic-Guided Reinforcement Learning (Hu RL) |
| Open Source Code | Yes | Code to replicate all experiments is available at https://github.com/microsoft/Hu RL. |
| Open Datasets | Yes | We validate our framework Hu RL experimentally in Mu Jo Co [32] robotics control problems and Procgen games [33] |
| Dataset Splits | No | The paper describes hyperparameter tuning and experimental runs but does not provide explicit training, validation, or test dataset splits in terms of percentages or counts, as is typical for static supervised learning datasets. Data is generated through environment interactions in RL. |
| Hardware Specification | Yes | All experiments were run on an internal GPU cluster of Microsoft Research, with Nvidia RTX 2080 Ti GPUs and Intel(R) Xeon(R) Gold 6248R CPUs. |
| Software Dependencies | No | The paper mentions using Garage [37], Ray [57], Mu Jo Co [32], and Procgen [33] but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The hyperparameters used in the algorithms above were selected as follows. First, the learning rates and the discount factor of the base RL algorithm, SAC, were tuned for each environment. ... For the Hu RL algorithms, the mixing coefficient was scheduled as λn = λ0 + (1 λ0)cω tanh(ω(n 1)), for n = 1, . . . , N, where λ0 [0, 1], ω > 0 controls the increasing rate, and cω is a normalization constant such that λ = 1 and λn [0, 1]. |