Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Emergence of In-Context Reinforcement Learning from Noise Distillation
Authors: Ilya Zisman, Vladislav Kurenkov, Alexander Nikulin, Viacheslav Sinii, Sergey Kolesnikov
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that it is viable to construct a synthetic noise injection curriculum which helps to obtain learning histories. Moreover, we experimentally demonstrate that it is possible to alleviate the need for generation using optimal policies, with in-context RL still able to outperform the best suboptimal policy in a learning dataset by a 2x margin. |
| Researcher Affiliation | Collaboration | 1AIRI, Moscow, Russia 2Skoltech, Moscow, Russia 3Innopolis University, Kazan, Russia 4MIPT, Moscow, Russia 5Tinkoff, Moscow, Russia. *Work done while at Tinkoff |
| Pseudocode | Yes | Algorithm 1 Data Generation |
| Open Source Code | Yes | Our implementation is available at https://github.com/ corl-team/ad-eps |
| Open Datasets | No | The paper describes generating data within custom environments (Dark Room, Key-to-Door, Watermaze) and does not provide concrete access information (link, DOI, formal citation) for a publicly available or open dataset used for training. |
| Dataset Splits | Yes | In total, there are 81 goals, of which we use 65 for training and 16 for evaluation. We employ Gtrain, Geval, Gtest task split, Gtrain, Geval during the pre-training phase to select the best model, Gtest during evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions software packages like CORL package, stable-baselines3, DMLab, and Shimmy package, but does not provide specific version numbers for these software dependencies required for replication. |
| Experiment Setup | Yes | The exact hyperparameters of the model can be found in Appendix F. We compute the decay rate by regulating how many histories (full trajectories until termination) are generated for a single goal. The exact number of histories is reported in Appendix G. |