Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

History Compression via Language Models in Reinforcement Learning

Authors: Fabian Paischer, Thomas Adler, Vihang Patil, Angela Bitto-Nemling, Markus Holzleitner, Sebastian Lehner, Hamid Eghbal-Zadeh, Sepp Hochreiter

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on partially observable environments to exploit the compressed history abstraction of the PLT. Furthermore, we train on procedurally generated environments which enhance diversity and force the agent to learn generalizable skills by sampling level conﬁgurations from a predeﬁned distribution.
Researcher Affiliation	Academia	1LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria 2ELLIS Unit Linz 3Institute of Advanced Research in Artiﬁcial Intelligence (IARAI), Vienna, Austria. Correspondence to: Fabian Paischer <EMAIL>.
Pseudocode	Yes	Algorithm 1 HELM
Open Source Code	Yes	Our code is available at https://github.com/ml-jku/helm.
Open Datasets	Yes	On Minigrid and Procgen environments HELM achieves new state-of-the-art results. Our code is available at https://github.com/ml-jku/helm. As toytasks we generate a Random Maze environment (Zuo, 2018), select a memory-dependent environment from Minigrid (Key Corridor, Chevalier-Boisvert et al., 2018), and evaluate our agents on complex environments from the Procgen suite (Cobbe et al., 2020).
Dataset Splits	Yes	We evaluate for sample efﬁciency by measuring the performance at the end of training and test for statistical signiﬁcance via a one-sided Wilcoxon rank-sum test (Wilcoxon, 1945) at a conﬁdence level of α = 0.05 %. The performance is evaluated by measuring the interquartile mean (IQM) and 95 % bootstrapped conﬁdence intervals (CIs), as proposed in Agarwal et al. (2021). We train for 2M interaction steps and evaluate for sample efﬁciency at the end of training. The budget of interaction steps for Procgen is limited to 25M steps and we train on the entire level distribution across 10 seeds to evaluate for sample efﬁciency.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions 'We use the stable-baselines3 (Rafﬁn et al., 2019) package for the implementation of our algorithm.' and 'We use the huggingface implementation (Wolf et al., 2020) of Tr XL'. However, specific version numbers for these software packages are not provided.
Experiment Setup	Yes	We perform a parameter search over learning rate in {3e-4, 1e-5, 5e-5, 1e-4}, entropy coefﬁcient in {0.05, 0.01, 0.005, 0.001}, rollout length in {64, 128, 256}, and the softmax scaling factor β {0.5, 1, 10, 50, 100} for HELM for the Minigrid environments and the Random Maze environment. For Procgen environments, we reduce the possible values for the entropy coefﬁcient to [0.01, 0.005, 0.001] and for β to {1, 10, 100}. The best hyperparameters for HELM for all Minigrid environments are shown in Table 3 while the best hyperparameters for the Procgen environments are shown in Table 4.