In-context Exploration-Exploitation for Reinforcement Learning
Authors: Zhenwen Dai, Federico Tomasi, Sina Ghiassian
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method. |
| Researcher Affiliation | Industry | Zhenwen Dai, Federico Tomasi, Sina Ghiassian Spotify Research {zhenwend,federicot,sinag}@spotify.com |
| Pseudocode | Yes | Algorithm 1: In-context Exploration-Exploitation (ICEE) Action Inference |
| Open Source Code | No | The paper does not provide an explicit statement or link to the open-source code for the ICEE methodology described in the paper. |
| Open Datasets | Yes | We use the two grid world environments in (Lee et al., 2022): dark room and dark key-to-door. (Section 7); The list of 2D functions used for evaluations are: Branin, Beale, Bohachevsky, Bukin6, De Jong5, Drop Wave, Eggholder, Goldstein Price, Holder Table, Kim1, Kim2, Kim3, Michalewicz, Shubert, Six Hump Camel, Three Hump Camel. (Appendix D) |
| Dataset Splits | No | The paper mentions 'offline training' and 'evaluation', and provides details on the number of episodes per sequence, but does not explicitly specify a distinct validation set or its size/proportion. |
| Hardware Specification | Yes | All the methods run on a single A100 GPU. |
| Software Dependencies | No | ICEE is implemented based on nano GPT 2. For the RL experiments, ICEE contains 12 layers with 128 dimensional embeddings. There are 4 heads in the multi-head attention. We use the Adam optimizer with the learning rate 10 5. (Appendix C). The expected improvement baseline is implemented using BOTorch (Balandat et al., 2020). (Appendix D). Neither nano GPT 2 nor BOTorch have explicit version numbers stated for reproducibility. |
| Experiment Setup | Yes | ICEE contains 12 layers with 128 dimensional embeddings. There are 4 heads in the multi-head attention. We use the Adam optimizer with the learning rate 10 5. (Appendix C); To encourage ICEE to solve the games quickly, when calculating the in-episode return-to-go, a negative reward, 1/T, is given to each step that does not receive a reward, where T is the episode length. (Section 7) |