History Compression via Language Models in Reinforcement Learning

Authors: Fabian Paischer, Thomas Adler, Vihang Patil, Angela Bitto-Nemling, Markus Holzleitner, Sebastian Lehner, Hamid Eghbal-Zadeh, Sepp Hochreiter

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on partially observable environments to exploit the compressed history abstraction of the PLT. Furthermore, we train on procedurally generated environments which enhance diversity and force the agent to learn generalizable skills by sampling level configurations from a predefined distribution.
Researcher Affiliation Academia 1LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria 2ELLIS Unit Linz 3Institute of Advanced Research in Artificial Intelligence (IARAI), Vienna, Austria. Correspondence to: Fabian Paischer <paischer@ml.jku.at>.
Pseudocode Yes Algorithm 1 HELM
Open Source Code Yes Our code is available at https://github.com/ml-jku/helm.
Open Datasets Yes On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm. As toytasks we generate a Random Maze environment (Zuo, 2018), select a memory-dependent environment from Minigrid (Key Corridor, Chevalier-Boisvert et al., 2018), and evaluate our agents on complex environments from the Procgen suite (Cobbe et al., 2020).
Dataset Splits Yes We evaluate for sample efficiency by measuring the performance at the end of training and test for statistical significance via a one-sided Wilcoxon rank-sum test (Wilcoxon, 1945) at a confidence level of α = 0.05 %. The performance is evaluated by measuring the interquartile mean (IQM) and 95 % bootstrapped confidence intervals (CIs), as proposed in Agarwal et al. (2021). We train for 2M interaction steps and evaluate for sample efficiency at the end of training. The budget of interaction steps for Procgen is limited to 25M steps and we train on the entire level distribution across 10 seeds to evaluate for sample efficiency.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions 'We use the stable-baselines3 (Raffin et al., 2019) package for the implementation of our algorithm.' and 'We use the huggingface implementation (Wolf et al., 2020) of Tr XL'. However, specific version numbers for these software packages are not provided.
Experiment Setup Yes We perform a parameter search over learning rate in {3e-4, 1e-5, 5e-5, 1e-4}, entropy coefficient in {0.05, 0.01, 0.005, 0.001}, rollout length in {64, 128, 256}, and the softmax scaling factor β {0.5, 1, 10, 50, 100} for HELM for the Minigrid environments and the Random Maze environment. For Procgen environments, we reduce the possible values for the entropy coefficient to [0.01, 0.005, 0.001] and for β to {1, 10, 100}. The best hyperparameters for HELM for all Minigrid environments are shown in Table 3 while the best hyperparameters for the Procgen environments are shown in Table 4.