Decision Mamba: Reinforcement Learning via Hybrid Selective Sequence Modeling

Authors: Sili Huang, Jifeng Hu, Zhejian Yang, Liwei Yang, Tao Luo, Hechang Chen, Lichao Sun, Bo Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that DM-H achieves state-of-the-art in long and short-term tasks, such as D4RL, Grid World, and Tmaze benchmarks. Regarding efficiency, the online testing of DM-H in the long-term task is 28 times faster than the transformer-based baselines.
Researcher Affiliation Collaboration 1,8Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education 1,2,3,6School of Artificial Intelligence, Jilin University, China 4,5 Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 7Lehigh University, Bethlehem, Pennsylvania, USA
Pseudocode Yes Algorithm 1: Decision Mamba-Hybrid. Input: A dataset of Trajectories, Max Iterations M as training phase, Max episodes m at testing phase, A number of trajectories n in across-episodic contexts used in Mamba model, A number of steps of actions c for one sub-goals Output: The generated actions
Open Source Code Yes Source code and more hyperparameters are described in Appendix B. We provide our code at...
Open Datasets Yes Dataset: Grid World. Dataset: Tmaze. Dataset: D4RL [13] is a commonly used offline RL benchmark, including continuous control tasks.
Dataset Splits No The paper mentions "offline training" and "sampling minibatches of trajectories" but does not specify explicit train/validation/test splits by percentages or sample counts for their experiments.
Hardware Specification Yes Experiments are carried out on NVIDIA Ge Force RTX 3090 GPUs and NVIDIA A10 GPUs. Besides, the CPU type is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes In D4RL, Tmaze, and Large Grid World, the transformer model generates c = 20 steps actions while Mamba model generates one sub-goal. In conventional Grid World, we set c = 5 because the task is too short. In summary, Table 3 shows the hyperparameters used in our DM-H model.