Learning from Visual Observation via Offline Pretrained State-to-Go Transformer

Authors: Bohan Zhou, Ke Li, Jiechuan Jiang, Zongqing Lu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on Atari and Minecraft show that our proposed method outperforms baselines and in some tasks even achieves performance comparable to the policy learned from environmental rewards. These results shed light on the potential of utilizing video-only data to solve difficult visual reinforcement learning tasks rather than relying on complete offline datasets containing states, actions, and rewards.
Researcher Affiliation Academia Bohan Zhou1 Ke Li2 Jiechuan Jiang1 Zongqing Lu1,2 1 School of Computer Science, Peking University 2 Beijing Academy of Artificial Intelligence
Pseudocode Yes Algorithm 1 in Appendix A details the offline pretraining of the STG Transformer. Algorithm 2 Online Reinforcement Learning with Intrinsic Rewards.
Open Source Code Yes The project s website and code can be found at https://sites.google.com/view/stgtransformer.
Open Datasets Yes For Qbert and Space Invaders, we collect the last 105 transitions (around 50 trajectories) from Google Dopamine [43] DQN replay experiences. For Breakout and Freeway, we alternatively train a SAC agent [44] from scratch for 5 106 steps and leverage the trained policy to gather approximately 50 trajectories (around 105 transitions) in each game to construct the expert dataset. Recently, various algorithms, e.g., Plan4MC [16] and CLIP4MC [46] have been proposed for Minecraft tasks. To create expert datasets, for each task, we utilize the learned policies of these two algorithms to collect 100 trajectories (around 5 104 observations).
Dataset Splits No The paper does not explicitly provide traditional training/validation/test dataset splits with percentages, absolute counts, or references to predefined splits for reproduction. For reinforcement learning experiments, data is often generated through interaction with an environment rather than using a static pre-split dataset for training, validation, and testing in the supervised learning sense.
Hardware Specification Yes Type of GPUs A100, or Nvidia RTX 4090 Ti
Software Dependencies No The paper mentions various optimizers (Adam, RMSprop) and algorithms/frameworks (PPO, SAC, GPT, WGAN, Minedojo, SIL, Dopamine, Plan4MC, CLIP4MC) but does not provide specific version numbers for any software libraries, programming languages, or environments (e.g., Python, PyTorch, TensorFlow, Gym) that would be needed for reproducible setup.
Experiment Setup Yes Table 3: Hyperparameters for Offline Pretraining, Table 4: General Hyperparameters for PPO, and Table 5: Specific Hyperparameters for Different Tasks explicitly list various hyperparameters and training settings such as learning rates, batch sizes, optimizer types, discount factors, clip ratios, and coefficients for different loss components.