Emergence of In-Context Reinforcement Learning from Noise Distillation
Authors: Ilya Zisman, Vladislav Kurenkov, Alexander Nikulin, Viacheslav Sinii, Sergey Kolesnikov
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that it is viable to construct a synthetic noise injection curriculum which helps to obtain learning histories. Moreover, we experimentally demonstrate that it is possible to alleviate the need for generation using optimal policies, with in-context RL still able to outperform the best suboptimal policy in a learning dataset by a 2x margin. |
| Researcher Affiliation | Collaboration | 1AIRI, Moscow, Russia 2Skoltech, Moscow, Russia 3Innopolis University, Kazan, Russia 4MIPT, Moscow, Russia 5Tinkoff, Moscow, Russia. *Work done while at Tinkoff |
| Pseudocode | Yes | Algorithm 1 Data Generation |
| Open Source Code | Yes | Our implementation is available at https://github.com/ corl-team/ad-eps |
| Open Datasets | No | The paper describes generating data within custom environments (Dark Room, Key-to-Door, Watermaze) and does not provide concrete access information (link, DOI, formal citation) for a publicly available or open dataset used for training. |
| Dataset Splits | Yes | In total, there are 81 goals, of which we use 65 for training and 16 for evaluation. We employ Gtrain, Geval, Gtest task split, Gtrain, Geval during the pre-training phase to select the best model, Gtest during evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions software packages like CORL package, stable-baselines3, DMLab, and Shimmy package, but does not provide specific version numbers for these software dependencies required for replication. |
| Experiment Setup | Yes | The exact hyperparameters of the model can be found in Appendix F. We compute the decay rate by regulating how many histories (full trajectories until termination) are generated for a single goal. The exact number of histories is reported in Appendix G. |