Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

Authors: Adam R Villaflor, Zhe Huang, Swapnil Pande, John M Dolan, Jeff Schneider

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our method s superior performance on a variety of autonomous driving tasks in simulation. For all experiments, we compare our SPLT Transformer method to Trajectory Transformer (TT), Decision Transformer (DT), and Behavioral Cloning (BC) with a Transformer model.
Researcher Affiliation Academia Adam Villaflor 1 Zhe Huang 1 Swapnil Pande 1 John Dolan 1 Jeff Schneider 1 1Carnegie Mellon University. Correspondence to: Adam Villaflor <avillaflor@cmu.edu>, Jeff Schnedier <jeff.schneider@cs.cmu.edu>.
Pseudocode No The paper describes procedures in text but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/avillaflor/SPLTtransformer
Open Datasets Yes Most prior works in offline RL have focused on the mainly deterministic D4RL (Fu et al., 2020) benchmarks and a variety of weakly stochastic Atari (Machado et al., 2018) benchmarks. We evaluate our method on the CARLA (Dosovitskiy et al., 2017) No Crash (Codevilla et al., 2019) benchmark.
Dataset Splits No The paper mentions training on Town01 data and evaluating on unseen Town02 routes, but does not provide specific details on how the datasets were split into training, validation, and test sets, nor percentages or sample counts for these splits.
Hardware Specification No The paper mentions 'modern GPU hardware' in the context of computational efficiency but does not provide specific details such as GPU models, CPU types, or memory specifications used for experiments.
Software Dependencies Yes For these experiments, we run the 0.9.11 version of CARLA at 5fps. For these experiments, we run the 0.9.10.1 version of CARLA at 10fps.
Experiment Setup Yes For all Transformer-based methods across all experiments, we kept the general Transformer hyperparameters consistent. We used 4 layers of self-attention blocks with 8 heads and an embedding size of 128. ... For our SPLT method, the only additional important hyperparameters are c, nw, and nπ for the latent variables, β for the VAE, and h and k for the planning. We generally did a hyperparameter search over nw [2, 4], nπ [2, 4], β {1e 4, 1e 3, 1e 2}, h {5, 10} and k {2, 5}. For, the toy illustrative problem we used c = 2, nw = 2, nπ = 3, β = 1e 3, h = 5, and k = 5. For No Crash, we used c = 2, nw = 3, nπ = 2, β = 0.01, h = 5, and k = 2. For Leaderboard, we used c = 2, nw = 3, nπ = 2, β = 0.01, h = 5, and k = 2.