In-Context Reinforcement Learning for Variable Action Spaces
Authors: Viacheslav Sinii, Alexander Nikulin, Vladislav Kurenkov, Ilya Zisman, Sergey Kolesnikov
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments using Bernoulli and contextual bandits, and a darkroom environment with changing action spaces, we demonstrate that Headless-AD is capable of matching the performance of the original data generation algorithm and scaling to action spaces up to 5x larger than those seen during training. We also observed that Headless AD can even outperform AD when they are both trained for the same action space, especially when evaluated on larger action sets. |
| Researcher Affiliation | Collaboration | 1Tinkoff, Moscow, Russia 2Innopolis University 3AIRI, Moscow, Russia 4MIPT. |
| Pseudocode | Yes | Listing 1: Code that demonstrates the Headless-AD training procedure. Note that this snippet is intended for illustration purposes only. The complete code can be found in Headless-AD s repository. |
| Open Source Code | Yes | Implementation is available at: https://github.com/corl-team/headless-ad. |
| Open Datasets | No | The paper mentions environments like Bernoulli Bandit, Contextual Bandit, and Darkroom, and describes how data was generated (e.g., using Thompson Sampling or Lin UCB). However, it does not provide concrete access information (like a URL or formal citation) to these datasets if they are considered external or publicly available. The data is generated for the experiments, not sourced from a pre-existing public dataset with explicit access details. |
| Dataset Splits | No | The paper mentions 'The training dataset consisted of bandits with 4–20 arms', and discusses training and test distributions, but does not explicitly provide percentages or counts for training, validation, and test splits for reproducibility. |
| Hardware Specification | Yes | All experiments were performed on A100 GPUs. |
| Software Dependencies | Yes | To sample the orthonormal vectors used as action embeddings, we use the torch.nn.init.orthogonal function from PyTorch (Paszke et al., 2019) |
| Experiment Setup | Yes | In our experiments, we used the Tiny LLaMA (Zhang et al., 2024) implementation of the transformer model and AdamW optimizer (Loshchilov & Hutter, 2017). All environment specific hyperparameters are listed in Appendix J. Table 3. Headless-AD s Environment-Specific Hyperparameters: For certain instances, hyperparameters underwent optimization within the specified ranges in the Sweep Values column, utilizing the Bayesian search method facilitated by the wandb sweep tool (Biewald, 2020). |