Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning
Authors: Max Weltevrede, Moritz Zanger, Matthijs Spaan, Wendelin Boehmer
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent. |
| Researcher Affiliation | Academia | Max Weltevrede Delft University of Technology Delft, The Netherlands EMAIL Moritz A. Zanger Delft University of Technology Delft, The Netherlands EMAIL Matthijs T. J. Spaan Delft University of Technology Delft, The Netherlands EMAIL Wendelin Bรถhmer Delft University of Technology Delft, The Netherlands EMAIL |
| Pseudocode | No | The paper describes algorithms and methods in prose but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for all the experiments in the main text can be found at https://github.com/ MWeltevrede/distillation-after-training. |
| Open Datasets | Yes | In Section 5.2, we demonstrate that the insights also apply to the more complex Minigrid Four Rooms environment (Chevalier-Boisvert et al., 2023) that breaks most of the assumptions required for the proof in Section 4. For experimental details, see Appendix C. |
| Dataset Splits | Yes | We used separate validation and testing sets consisting of unseen contexts of size 40 and 200 respectively. The validation set was used for algorithm development and hyperparameter tuning, and the test set was only used as a final evaluation (and is reported in Table 3). |
| Hardware Specification | Yes | The Four Room experiments were executed on a computer with an NVIDIA RTX 3070 GPU, Intel Core i7 12700 CPU and 32 GB of memory. |
| Software Dependencies | No | The paper mentions using Stable-Baselines3 but does not provide a specific version number. It does not list other software dependencies with versions. |
| Experiment Setup | Yes | Table 4: Hyper-parameters used for the Reacher with rotational symmetry CMDP experiments Hyper-parameter Value Epochs 500 Batch size 6 Learning rate 1 10 4 Table 6: Hyperparameters used for policy distillation in the Four Rooms environment. Four Rooms Distillation Hyper-parameter Value Teacher Epochs 100 Batch size 64 Learning rate 1 10 4 Explore-Go Epochs 50 Batch size 512 Learning rate 1 10 3 Mixed Epochs 50 Batch size 256 Learning rate 1 10 3 |