Decision Transformer: Reinforcement Learning via Sequence Modeling
Authors: Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, Open AI Gym, and Key-to-Door tasks. We evaluate on both discrete (Atari [13]) and continuous (Open AI Gym [14]) control tasks. |
| Researcher Affiliation | Collaboration | Lili Chen ,1, Kevin Lu ,1, Aravind Rajeswaran2, Kimin Lee1, Aditya Grover2,3, Michael Laskin1, Pieter Abbeel1, Aravind Srinivas ,4, Igor Mordatch ,5 equal contribution equal advising 1UC Berkeley 2Facebook AI Research 3UCLA 4Open AI 5Google Brain |
| Pseudocode | Yes | Algorithm 1 Decision Transformer Pseudocode (for continuous actions) |
| Open Source Code | Yes | 1Our code is available at: https://sites.google.com/berkeley.edu/decision-transformer |
| Open Datasets | Yes | We evaluate our method on 1% of all samples in the DQN-replay dataset as per Agarwal et al. [16]. In this section, we consider the continuous control tasks from the D4RL benchmark [24]. |
| Dataset Splits | No | The paper mentions evaluating on specific datasets (DQN-replay, D4RL) and sampling minibatches, but it does not explicitly provide the training, validation, and test splits (e.g., percentages or counts) within the main text. It implies use of a dataset for evaluation but does not specify how it's partitioned. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using the GPT architecture and implies standard deep learning frameworks (like PyTorch, given the pseudocode), but it does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We feed the last K timesteps into Decision Transformer, for a total of 3K tokens. We use context lengths of K = 30 for Decision Transformer (except K = 50 for Pong); for results with different values of K see the supplementary material. The prediction head corresponding to the input token st is trained to predict at either with cross-entropy loss for discrete actions or mean-squared error for continuous actions and the losses for each timestep are averaged. |