Offline Reinforcement Learning as One Big Sequence Modeling Problem
Authors: Michael Janner, Qiyang Li, Sergey Levine
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental evaluation focuses on (1) the accuracy of the Trajectory Transformer as a longhorizon predictor compared to standard dynamics model parameterizations and (2) the utility of sequence modeling tools namely beam search as a control algorithm in the context of offline reinforcement learning, imitation learning, and goal-reaching. [...] Results for the locomotion environments are shown in Table 1. [...] Ant Maze results are provided in Table 2. |
| Researcher Affiliation | Academia | Michael Janner Qiyang Li Sergey Levine University of California at Berkeley {janner, qcli}@berkeley.edu svlevine@eecs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Beam search |
| Open Source Code | Yes | Code is available at trajectory-transformer.github.io |
| Open Datasets | Yes | We evaluate the Trajectory Transformer on a number of environments from the D4RL offline benchmark suite (Fu et al., 2020), including the locomotion and Ant Maze domains. |
| Dataset Splits | No | The paper mentions 'training is performed' and discusses 'training set' in the context of discretization, and uses standard D4RL benchmarks, but does not explicitly state the specific training/validation/test dataset splits (e.g., percentages or counts) within the paper. |
| Hardware Specification | No | The paper mentions 'computational resource donations from Microsoft' but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper lists 'Num Py (Harris et al., 2020), Py Torch (Paszke et al., 2019), and min GPT (Karpathy, 2020)' but does not provide explicit version numbers for these libraries (e.g., PyTorch 1.9). |
| Experiment Setup | Yes | Our model is a Transformer decoder mirroring the GPT architecture (Radford et al., 2018). We use a smaller architecture than those typically used in large-scale language modeling, consisting of four layers and four self-attention heads. [...] We use the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 2.5 10 4 to train parameters θ. |