Structured State Space Models for In-Context Reinforcement Learning

Authors: Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, Feryal Behbahani

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our modified architecture on a set of partially-observable environments and find that, in practice, our model outperforms RNN s while also running over five times faster. Then, by leveraging the model s ability to handle long-range sequences, we achieve strong performance on a challenging meta-learning task in which the agent is given a randomly-sampled continuous control environment, combined with a randomly-sampled linear projection of the environment s observations and actions. Furthermore, we show the resulting model can adapt to out-of-distribution held-out tasks. Overall, the results presented in this paper show that structured state space models are fast and performant for in-context reinforcement learning tasks. We provide code at https://github.com/luchris429/s5rl.
Researcher Affiliation Collaboration Chris Lu FLAIR, University of Oxford Yannick Schroecker Deep Mind Albert Gu Deep Mind Emilio Parisotto Deep Mind Jakob Foerster FLAIR, University of Oxford Satinder Singh Deep Mind Feryal Behbahani Deep Mind
Pseudocode Yes Algorithm 1 Pseudocode for the Multi-Environment Meta-Learning environment step.
Open Source Code Yes We provide code at https://github.com/luchris429/s5rl.
Open Datasets Yes First, we demonstrate our modified S5 s improved training speeds in performance in the extremely simple memory length environment proposed in bsuite [32]. We evaluate our S5 architecture on environments from the Partially Observable Process Gym (POPGym) [27] suite, a set of simple environments designed to benchmark memory in deep RL. We selected all of the DMControl environments and tasks that had observation and action spaces of size equal to or less than those values and split them into train and test set environments.
Dataset Splits No The paper discusses training and testing environments (tasks) but does not provide explicit training, validation, and test dataset splits in terms of percentages or sample counts for fixed datasets. It focuses on reinforcement learning environments where data is generated dynamically.
Hardware Specification Yes Runs were performed on a single NVIDIA A100. When evaluating our architecture on this suite, we show that S5 outperforms GRU s while also running over six times faster, achieving state-of-the-art results on the Repeat Hard task, which all other architectures previously struggled to solve. Note that our implementation is end-to-end compiled to run entirely on a single NVIDIA A40. These experiments were run using 64 TPUv3 s.
Software Dependencies No The paper mentions JAX, PyTorch, Gymnax, Muesli, RLLib, Stable Baselines3, and Clean RL, but does not provide specific version numbers for these software components as dependencies for reproducibility.
Experiment Setup Yes We include more discussion and the hyperparameters in Appendix B. Table 2: Hyperparameters for training A2C on Bsuite. Table 3: Hyperparameters for training PPO on POPGym. Table 4: Hyperparameters for training Muesli on Multi-Environment Meta-RL.