Offline Reinforcement Learning as One Big Sequence Modeling Problem

Authors: Michael Janner, Qiyang Li, Sergey Levine

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental evaluation focuses on (1) the accuracy of the Trajectory Transformer as a longhorizon predictor compared to standard dynamics model parameterizations and (2) the utility of sequence modeling tools namely beam search as a control algorithm in the context of offline reinforcement learning, imitation learning, and goal-reaching. [...] Results for the locomotion environments are shown in Table 1. [...] Ant Maze results are provided in Table 2.
Researcher Affiliation Academia Michael Janner Qiyang Li Sergey Levine University of California at Berkeley {janner, qcli}@berkeley.edu svlevine@eecs.berkeley.edu
Pseudocode Yes Algorithm 1 Beam search
Open Source Code Yes Code is available at trajectory-transformer.github.io
Open Datasets Yes We evaluate the Trajectory Transformer on a number of environments from the D4RL offline benchmark suite (Fu et al., 2020), including the locomotion and Ant Maze domains.
Dataset Splits No The paper mentions 'training is performed' and discusses 'training set' in the context of discretization, and uses standard D4RL benchmarks, but does not explicitly state the specific training/validation/test dataset splits (e.g., percentages or counts) within the paper.
Hardware Specification No The paper mentions 'computational resource donations from Microsoft' but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper lists 'Num Py (Harris et al., 2020), Py Torch (Paszke et al., 2019), and min GPT (Karpathy, 2020)' but does not provide explicit version numbers for these libraries (e.g., PyTorch 1.9).
Experiment Setup Yes Our model is a Transformer decoder mirroring the GPT architecture (Radford et al., 2018). We use a smaller architecture than those typically used in large-scale language modeling, consisting of four layers and four self-attention heads. [...] We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 2.5 10 4 to train parameters θ.