Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?
Authors: Yang Dai, Oubo Ma, Longfei Zhang, Xingxing Liang, Shengchao Hu, Mengzhu Wang, Shouling Ji, Jincai Huang, Li Shen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | this work aims to conduct comprehensive experiments to explore the potential of Decision Mamba (dubbed De Ma) in offline RL from the aspect of data structures and essential components with the following insights: (1) Long sequences impose a significant computational burden without contributing to performance improvements since De Ma s focus on sequences diminishes approximately exponentially. Consequently, we introduce a Transformer-like De Ma as opposed to an RNN-like De Ma. (2) For the components of De Ma, we identify the hidden attention mechanism as a critical factor in its success, which can also work well with other residual structures and does not require position embedding. Extensive evaluations demonstrate that our specially designed De Ma is compatible with trajectory optimization and surpasses previous methods, outperforming Decision Transformer (DT) with higher performance while using 30% fewer parameters in Atari, and exceeding DT with only a quarter of the parameters in Mu Jo Co. |
| Researcher Affiliation | Academia | 1Laboratory for Big Data and Decision, National University of Defense Technology 2Zhejiang University 3Shanghai Jiao Tong University 4Hebei University of Technology 5Shenzhen Campus of Sun Yat-sen University |
| Pseudocode | No | The paper describes the procedures of training and inference, but does not present them in a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Our code is available at https: //github.com/Andss Y/De Ma. |
| Open Datasets | Yes | We conduct experiments in eight different games: Breakout, Qbert, Pong, Seaquest, Asterix, Frostbite, Assault, and Gopher. We use 1% DQN Replay Dataset [64] as our training dataset, which encompasses a total of 500,000 timesteps worth of samples generated throughout the training process of a DQN agent [65]. It s worth noting that the version of "atari-py" and "gym" we use is 0.2.5 and 0.19.0 respectively, which is noted by the official code in https://github.com/google-research/batch_rl. |
| Dataset Splits | No | Given a dataset of offline trajectories, we randomly select a starting point and truncate it into a sequence of length K. |
| Hardware Specification | Yes | We use one NVIDIA Ge Force RTX 4090 to train each model in Mu Jo Co and one NVIDIA Ge Force RTX 3090 to train each model in Atari. |
| Software Dependencies | Yes | It s worth noting that the version of "atari-py" and "gym" we use is 0.2.5 and 0.19.0 respectively, which is noted by the official code in https://github.com/google-research/batch_rl. |
| Experiment Setup | Yes | Tables 8-10 provide a comprehensive list of hyper-parameters for our proposed transformer-like De Ma and RNN-like De Ma applied to Mu Jo Co and Atari environments. To ensure a fair comparison, we adopt similar hyper-parameter settings to DT [12] and DC [13]. |