Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets
Authors: Anirudhan Badrinath, Yannis Flet-Berliac, Allen Nie, Emma Brunskill
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results show a significant increase in the final return compared to existing Rv S methods, with performance on par or greater than existing popular temporal difference learning-based methods. Additionally, the performance and stability improvements are largest in the most challenging environments and data configurations, including Ant Maze Large Play/Diverse and Kitchen Mixed/Partial. |
| Researcher Affiliation | Academia | Anirudhan Badrinath Yannis Flet-Berliac Allen Nie Emma Brunskill Department of Computer Science Stanford University {abadrina, yfletberliac, anie, ebrun}@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 Training algorithm for transformer-based policy trained on offline dataset D. |
| Open Source Code | No | No explicit statement or link providing concrete access to the source code for the Waypoint Transformer methodology described in this paper was found. |
| Open Datasets | Yes | For this, we leverage D4RL, an open-source benchmark for offline RL, consisting of varying datasets for tasks from Gym-Mu Jo Co, Ant Maze, and Franka Kitchen [Fu et al., 2020]. |
| Dataset Splits | No | The paper refers to a 'held-out dataset' for validation loss (Figure 5) but does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits). |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions software components like 'GPT-2 architecture' and 'Adam optimizer', but it does not provide specific version numbers for any software dependencies or libraries needed to replicate the experiment. |
| Experiment Setup | Yes | In Table 3, we show the chosen hyperparameter configuration for WT across all experiments. Consistent with the neural network model in Rv S-R/G with 1.1M parameters Emmons et al. [2021], the WT contains 1.1M trainable parameters. For the most part, the chosen hyperparameters align closely with default values in deep learning; for example, we use the Re LU activation function and a learning rate of 0.001 with the Adam optimizer. In Table 4, we show the chosen hyperparameter configuration for the reward and goal waypoint networks across all experiments. |