In-context Exploration-Exploitation for Reinforcement Learning

Authors: Zhenwen Dai, Federico Tomasi, Sina Ghiassian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.
Researcher Affiliation Industry Zhenwen Dai, Federico Tomasi, Sina Ghiassian Spotify Research {zhenwend,federicot,sinag}@spotify.com
Pseudocode Yes Algorithm 1: In-context Exploration-Exploitation (ICEE) Action Inference
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the ICEE methodology described in the paper.
Open Datasets Yes We use the two grid world environments in (Lee et al., 2022): dark room and dark key-to-door. (Section 7); The list of 2D functions used for evaluations are: Branin, Beale, Bohachevsky, Bukin6, De Jong5, Drop Wave, Eggholder, Goldstein Price, Holder Table, Kim1, Kim2, Kim3, Michalewicz, Shubert, Six Hump Camel, Three Hump Camel. (Appendix D)
Dataset Splits No The paper mentions 'offline training' and 'evaluation', and provides details on the number of episodes per sequence, but does not explicitly specify a distinct validation set or its size/proportion.
Hardware Specification Yes All the methods run on a single A100 GPU.
Software Dependencies No ICEE is implemented based on nano GPT 2. For the RL experiments, ICEE contains 12 layers with 128 dimensional embeddings. There are 4 heads in the multi-head attention. We use the Adam optimizer with the learning rate 10 5. (Appendix C). The expected improvement baseline is implemented using BOTorch (Balandat et al., 2020). (Appendix D). Neither nano GPT 2 nor BOTorch have explicit version numbers stated for reproducibility.
Experiment Setup Yes ICEE contains 12 layers with 128 dimensional embeddings. There are 4 heads in the multi-head attention. We use the Adam optimizer with the learning rate 10 5. (Appendix C); To encourage ICEE to solve the games quickly, when calculating the in-episode return-to-go, a negative reward, 1/T, is given to each step that does not receive a reward, where T is the episode length. (Section 7)