Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets

Authors: Anirudhan Badrinath, Yannis Flet-Berliac, Allen Nie, Emma Brunskill

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results show a significant increase in the final return compared to existing Rv S methods, with performance on par or greater than existing popular temporal difference learning-based methods. Additionally, the performance and stability improvements are largest in the most challenging environments and data configurations, including Ant Maze Large Play/Diverse and Kitchen Mixed/Partial.
Researcher Affiliation Academia Anirudhan Badrinath Yannis Flet-Berliac Allen Nie Emma Brunskill Department of Computer Science Stanford University {abadrina, yfletberliac, anie, ebrun}@cs.stanford.edu
Pseudocode Yes Algorithm 1 Training algorithm for transformer-based policy trained on offline dataset D.
Open Source Code No No explicit statement or link providing concrete access to the source code for the Waypoint Transformer methodology described in this paper was found.
Open Datasets Yes For this, we leverage D4RL, an open-source benchmark for offline RL, consisting of varying datasets for tasks from Gym-Mu Jo Co, Ant Maze, and Franka Kitchen [Fu et al., 2020].
Dataset Splits No The paper refers to a 'held-out dataset' for validation loss (Figure 5) but does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits).
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided in the paper.
Software Dependencies No The paper mentions software components like 'GPT-2 architecture' and 'Adam optimizer', but it does not provide specific version numbers for any software dependencies or libraries needed to replicate the experiment.
Experiment Setup Yes In Table 3, we show the chosen hyperparameter configuration for WT across all experiments. Consistent with the neural network model in Rv S-R/G with 1.1M parameters Emmons et al. [2021], the WT contains 1.1M trainable parameters. For the most part, the chosen hyperparameters align closely with default values in deep learning; for example, we use the Re LU activation function and a learning rate of 0.001 with the Adam optimizer. In Table 4, we show the chosen hyperparameter configuration for the reward and goal waypoint networks across all experiments.