Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling
Authors: Sili Huang, Jifeng Hu, Zhejian Yang, Liwei Yang, Tao Luo, Hechang Chen, Lichao Sun, Bo Yang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that DM-H achieves state-of-the-art in long and short-term tasks, such as D4RL, Grid World, and Tmaze benchmarks. Regarding efficiency, the online testing of DM-H in the long-term task is 28 times faster than the transformer-based baselines. |
| Researcher Affiliation | Collaboration | 1,8Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education 1,2,3,6School of Artificial Intelligence, Jilin University, China 4,5 Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 7Lehigh University, Bethlehem, Pennsylvania, USA |
| Pseudocode | Yes | Algorithm 1: Decision Mamba-Hybrid. Input: A dataset of Trajectories, Max Iterations M as training phase, Max episodes m at testing phase, A number of trajectories n in across-episodic contexts used in Mamba model, A number of steps of actions c for one sub-goals Output: The generated actions |
| Open Source Code | Yes | Source code and more hyperparameters are described in Appendix B. We provide our code at... |
| Open Datasets | Yes | Dataset: Grid World. Dataset: Tmaze. Dataset: D4RL [13] is a commonly used offline RL benchmark, including continuous control tasks. |
| Dataset Splits | No | The paper mentions "offline training" and "sampling minibatches of trajectories" but does not specify explicit train/validation/test splits by percentages or sample counts for their experiments. |
| Hardware Specification | Yes | Experiments are carried out on NVIDIA Ge Force RTX 3090 GPUs and NVIDIA A10 GPUs. Besides, the CPU type is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | In D4RL, Tmaze, and Large Grid World, the transformer model generates c = 20 steps actions while Mamba model generates one sub-goal. In conventional Grid World, we set c = 5 because the task is too short. In summary, Table 3 shows the hyperparameters used in our DM-H model. |