Heuristic-Guided Reinforcement Learning
Authors: Ching-An Cheng, Andrey Kolobov, Adith Swaminathan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our framework Hu RL experimentally in Mu Jo Co [32] robotics control problems and Procgen games [33], where soft actor critic (SAC) [35] and proximal policy optimization (PPO) [36] were used as the base RL algorithms, respectively5. The goal is to study whether Hu RL can accelerate learning by shortening the horizon with heuristics. |
| Researcher Affiliation | Industry | Ching-An Cheng Microsoft Research Redmond, WA chinganc@microsoft.com Andrey Kolobov Microsoft Research Redmond, WA akolobov@microsoft.com Adith Swaminathan Microsoft Research Redmond, WA adswamin@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Heuristic-Guided Reinforcement Learning (Hu RL) |
| Open Source Code | Yes | Code to replicate all experiments is available at https://github.com/microsoft/Hu RL. |
| Open Datasets | Yes | We validate our framework Hu RL experimentally in Mu Jo Co [32] robotics control problems and Procgen games [33] |
| Dataset Splits | No | The paper describes hyperparameter tuning and experimental runs but does not provide explicit training, validation, or test dataset splits in terms of percentages or counts, as is typical for static supervised learning datasets. Data is generated through environment interactions in RL. |
| Hardware Specification | Yes | All experiments were run on an internal GPU cluster of Microsoft Research, with Nvidia RTX 2080 Ti GPUs and Intel(R) Xeon(R) Gold 6248R CPUs. |
| Software Dependencies | No | The paper mentions using Garage [37], Ray [57], Mu Jo Co [32], and Procgen [33] but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | The hyperparameters used in the algorithms above were selected as follows. First, the learning rates and the discount factor of the base RL algorithm, SAC, were tuned for each environment. ... For the Hu RL algorithms, the mixing coefficient was scheduled as λn = λ0 + (1 λ0)cω tanh(ω(n 1)), for n = 1, . . . , N, where λ0 [0, 1], ω > 0 controls the increasing rate, and cω is a normalization constant such that λ = 1 and λn [0, 1]. |