Decision S4: Efficient Sequence-Based RL via State Spaces Layers

Authors: Shmuel Bar David, Itamar Zimerman, Eliya Nachmani, Lior Wolf

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on multiple Mujoco (Todorov et al., 2012) benchmarks and show the advantage of our method over existing off-policy methods, including the decision transformer, and over similar on-policy methods.
Researcher Affiliation Collaboration Tel Aviv University Meta AI Research
Pseudocode Yes A sketch of our training scheme is shown in Alg. 1. For simplicity, the sketch ignores batches. ... Our on-policy fine-tuning scheme is summarized in Alg. 2. It is based on the Deep Deterministic Policy Gradient (DDPG) algorithm (Lillicrap et al., 2015).
Open Source Code No The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We evaluate our method on data from the Mujoco physical simulator (Todorov et al., 2012) and Ant Maze-v2 (Fu et al., 2020). ... To train the model, we used D4RL (Fu et al., 2020) datasets of recorded episodes of the environments.
Dataset Splits No The paper mentions using "batches of 32 trajectories" for training and discusses
Hardware Specification Yes All experiments take less than eight hours on a single NVIDIA RTX 2080Ti GPU, which is similar to DT.
Software Dependencies No The paper mentions optimizing the model using "Adam Kingma & Ba (2014)", which is an optimization algorithm, but does not specify any software libraries, frameworks, or other ancillary software components with version numbers.
Experiment Setup Yes For off-policy training we used batches of 32 trajectories, using the maximum trajectory length in that batch as the length for the entire batch, then filling shorter trajectories with blank input. The training was done with a learning rate of 10 5 and about 10000 warm-up steps, with linear incrementation until reaching the highest rate. We optimized the model using Adam Kingma & Ba (2014), with a weight decay of 10 4. ... Training for the models was done in batches of 96, with different learning rates for the critic and actor, specifically αC = 10 3 and αX = 10 5. The training occurred every K1 = 200 steps of the environment, and the target models were updated every K2 = 300 steps.