Decision S4: Efficient Sequence-Based RL via State Spaces Layers
Authors: Shmuel Bar David, Itamar Zimerman, Eliya Nachmani, Lior Wolf
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on multiple Mujoco (Todorov et al., 2012) benchmarks and show the advantage of our method over existing off-policy methods, including the decision transformer, and over similar on-policy methods. |
| Researcher Affiliation | Collaboration | Tel Aviv University Meta AI Research |
| Pseudocode | Yes | A sketch of our training scheme is shown in Alg. 1. For simplicity, the sketch ignores batches. ... Our on-policy fine-tuning scheme is summarized in Alg. 2. It is based on the Deep Deterministic Policy Gradient (DDPG) algorithm (Lillicrap et al., 2015). |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluate our method on data from the Mujoco physical simulator (Todorov et al., 2012) and Ant Maze-v2 (Fu et al., 2020). ... To train the model, we used D4RL (Fu et al., 2020) datasets of recorded episodes of the environments. |
| Dataset Splits | No | The paper mentions using "batches of 32 trajectories" for training and discusses |
| Hardware Specification | Yes | All experiments take less than eight hours on a single NVIDIA RTX 2080Ti GPU, which is similar to DT. |
| Software Dependencies | No | The paper mentions optimizing the model using "Adam Kingma & Ba (2014)", which is an optimization algorithm, but does not specify any software libraries, frameworks, or other ancillary software components with version numbers. |
| Experiment Setup | Yes | For off-policy training we used batches of 32 trajectories, using the maximum trajectory length in that batch as the length for the entire batch, then filling shorter trajectories with blank input. The training was done with a learning rate of 10 5 and about 10000 warm-up steps, with linear incrementation until reaching the highest rate. We optimized the model using Adam Kingma & Ba (2014), with a weight decay of 10 4. ... Training for the models was done in batches of 96, with different learning rates for the critic and actor, specifically αC = 10 3 and αX = 10 5. The training occurred every K1 = 200 steps of the environment, and the target models were updated every K2 = 300 steps. |