reproducibilityindex.ai

In-context Exploration-Exploitation for Reinforcement Learning

Authors: Zhenwen Dai, Federico Tomasi, Sina Ghiassian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments in grid world environments, we demonstrate that ICEE can learn to solve new RL tasks using only tens of episodes, marking a substantial improvement over the hundreds of episodes needed by the previous in-context learning method.
Researcher Affiliation	Industry	Zhenwen Dai, Federico Tomasi, Sina Ghiassian Spotify Research {zhenwend,federicot,sinag}@spotify.com
Pseudocode	Yes	Algorithm 1: In-context Exploration-Exploitation (ICEE) Action Inference
Open Source Code	No	The paper does not provide an explicit statement or link to the open-source code for the ICEE methodology described in the paper.
Open Datasets	Yes	We use the two grid world environments in (Lee et al., 2022): dark room and dark key-to-door. (Section 7); The list of 2D functions used for evaluations are: Branin, Beale, Bohachevsky, Bukin6, De Jong5, Drop Wave, Eggholder, Goldstein Price, Holder Table, Kim1, Kim2, Kim3, Michalewicz, Shubert, Six Hump Camel, Three Hump Camel. (Appendix D)
Dataset Splits	No	The paper mentions 'offline training' and 'evaluation', and provides details on the number of episodes per sequence, but does not explicitly specify a distinct validation set or its size/proportion.
Hardware Specification	Yes	All the methods run on a single A100 GPU.
Software Dependencies	No	ICEE is implemented based on nano GPT 2. For the RL experiments, ICEE contains 12 layers with 128 dimensional embeddings. There are 4 heads in the multi-head attention. We use the Adam optimizer with the learning rate 10 5. (Appendix C). The expected improvement baseline is implemented using BOTorch (Balandat et al., 2020). (Appendix D). Neither nano GPT 2 nor BOTorch have explicit version numbers stated for reproducibility.
Experiment Setup	Yes	ICEE contains 12 layers with 128 dimensional embeddings. There are 4 heads in the multi-head attention. We use the Adam optimizer with the learning rate 10 5. (Appendix C); To encourage ICEE to solve the games quickly, when calculating the in-episode return-to-go, a negative reward, 1/T, is given to each step that does not receive a reward, where T is the episode length. (Section 7)