Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning

Authors: Max Weltevrede, Moritz Zanger, Matthijs Spaan, Wendelin Boehmer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.
Researcher Affiliation	Academia	Max Weltevrede Delft University of Technology Delft, The Netherlands EMAIL Moritz A. Zanger Delft University of Technology Delft, The Netherlands EMAIL Matthijs T. J. Spaan Delft University of Technology Delft, The Netherlands EMAIL Wendelin Böhmer Delft University of Technology Delft, The Netherlands EMAIL
Pseudocode	No	The paper describes algorithms and methods in prose but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code for all the experiments in the main text can be found at https://github.com/ MWeltevrede/distillation-after-training.
Open Datasets	Yes	In Section 5.2, we demonstrate that the insights also apply to the more complex Minigrid Four Rooms environment (Chevalier-Boisvert et al., 2023) that breaks most of the assumptions required for the proof in Section 4. For experimental details, see Appendix C.
Dataset Splits	Yes	We used separate validation and testing sets consisting of unseen contexts of size 40 and 200 respectively. The validation set was used for algorithm development and hyperparameter tuning, and the test set was only used as a final evaluation (and is reported in Table 3).
Hardware Specification	Yes	The Four Room experiments were executed on a computer with an NVIDIA RTX 3070 GPU, Intel Core i7 12700 CPU and 32 GB of memory.
Software Dependencies	No	The paper mentions using Stable-Baselines3 but does not provide a specific version number. It does not list other software dependencies with versions.
Experiment Setup	Yes	Table 4: Hyper-parameters used for the Reacher with rotational symmetry CMDP experiments Hyper-parameter Value Epochs 500 Batch size 6 Learning rate 1 10 4 Table 6: Hyperparameters used for policy distillation in the Four Rooms environment. Four Rooms Distillation Hyper-parameter Value Teacher Epochs 100 Batch size 64 Learning rate 1 10 4 Explore-Go Epochs 50 Batch size 512 Learning rate 1 10 3 Mixed Epochs 50 Batch size 256 Learning rate 1 10 3