Decision Transformer: Reinforcement Learning via Sequence Modeling

Authors: Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, Open AI Gym, and Key-to-Door tasks. We evaluate on both discrete (Atari [13]) and continuous (Open AI Gym [14]) control tasks.
Researcher Affiliation Collaboration Lili Chen ,1, Kevin Lu ,1, Aravind Rajeswaran2, Kimin Lee1, Aditya Grover2,3, Michael Laskin1, Pieter Abbeel1, Aravind Srinivas ,4, Igor Mordatch ,5 equal contribution equal advising 1UC Berkeley 2Facebook AI Research 3UCLA 4Open AI 5Google Brain
Pseudocode Yes Algorithm 1 Decision Transformer Pseudocode (for continuous actions)
Open Source Code Yes 1Our code is available at: https://sites.google.com/berkeley.edu/decision-transformer
Open Datasets Yes We evaluate our method on 1% of all samples in the DQN-replay dataset as per Agarwal et al. [16]. In this section, we consider the continuous control tasks from the D4RL benchmark [24].
Dataset Splits No The paper mentions evaluating on specific datasets (DQN-replay, D4RL) and sampling minibatches, but it does not explicitly provide the training, validation, and test splits (e.g., percentages or counts) within the main text. It implies use of a dataset for evaluation but does not specify how it's partitioned.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using the GPT architecture and implies standard deep learning frameworks (like PyTorch, given the pseudocode), but it does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We feed the last K timesteps into Decision Transformer, for a total of 3K tokens. We use context lengths of K = 30 for Decision Transformer (except K = 50 for Pong); for results with different values of K see the supplementary material. The prediction head corresponding to the input token st is trained to predict at either with cross-entropy loss for discrete actions or mean-squared error for continuous actions and the losses for each timestep are averaged.