Emergence of In-Context Reinforcement Learning from Noise Distillation

Authors: Ilya Zisman, Vladislav Kurenkov, Alexander Nikulin, Viacheslav Sinii, Sergey Kolesnikov

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that it is viable to construct a synthetic noise injection curriculum which helps to obtain learning histories. Moreover, we experimentally demonstrate that it is possible to alleviate the need for generation using optimal policies, with in-context RL still able to outperform the best suboptimal policy in a learning dataset by a 2x margin.
Researcher Affiliation Collaboration 1AIRI, Moscow, Russia 2Skoltech, Moscow, Russia 3Innopolis University, Kazan, Russia 4MIPT, Moscow, Russia 5Tinkoff, Moscow, Russia. *Work done while at Tinkoff
Pseudocode Yes Algorithm 1 Data Generation
Open Source Code Yes Our implementation is available at https://github.com/ corl-team/ad-eps
Open Datasets No The paper describes generating data within custom environments (Dark Room, Key-to-Door, Watermaze) and does not provide concrete access information (link, DOI, formal citation) for a publicly available or open dataset used for training.
Dataset Splits Yes In total, there are 81 goals, of which we use 65 for training and 16 for evaluation. We employ Gtrain, Geval, Gtest task split, Gtrain, Geval during the pre-training phase to select the best model, Gtest during evaluation.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions software packages like CORL package, stable-baselines3, DMLab, and Shimmy package, but does not provide specific version numbers for these software dependencies required for replication.
Experiment Setup Yes The exact hyperparameters of the model can be found in Appendix F. We compute the decay rate by regulating how many histories (full trajectories until termination) are generated for a single goal. The exact number of histories is reported in Appendix G.