Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets
Authors: Anirudhan Badrinath, Yannis Flet-Berliac, Allen Nie, Emma Brunskill
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results show a significant increase in the final return compared to existing Rv S methods, with performance on par or greater than existing popular temporal difference learning-based methods. Additionally, the performance and stability improvements are largest in the most challenging environments and data configurations, including Ant Maze Large Play/Diverse and Kitchen Mixed/Partial. |
| Researcher Affiliation | Academia | Anirudhan Badrinath Yannis Flet-Berliac Allen Nie Emma Brunskill Department of Computer Science Stanford University EMAIL |
| Pseudocode | Yes | Algorithm 1 Training algorithm for transformer-based policy trained on offline dataset D. |
| Open Source Code | No | No explicit statement or link providing concrete access to the source code for the Waypoint Transformer methodology described in this paper was found. |
| Open Datasets | Yes | For this, we leverage D4RL, an open-source benchmark for offline RL, consisting of varying datasets for tasks from Gym-Mu Jo Co, Ant Maze, and Franka Kitchen [Fu et al., 2020]. |
| Dataset Splits | No | The paper refers to a 'held-out dataset' for validation loss (Figure 5) but does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits). |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions software components like 'GPT-2 architecture' and 'Adam optimizer', but it does not provide specific version numbers for any software dependencies or libraries needed to replicate the experiment. |
| Experiment Setup | Yes | In Table 3, we show the chosen hyperparameter configuration for WT across all experiments. Consistent with the neural network model in Rv S-R/G with 1.1M parameters Emmons et al. [2021], the WT contains 1.1M trainable parameters. For the most part, the chosen hyperparameters align closely with default values in deep learning; for example, we use the Re LU activation function and a learning rate of 0.001 with the Adam optimizer. In Table 4, we show the chosen hyperparameter configuration for the reward and goal waypoint networks across all experiments. |