reproducibilityindex.ai

Learning from Visual Observation via Offline Pretrained State-to-Go Transformer

Authors: Bohan Zhou, Ke Li, Jiechuan Jiang, Zongqing Lu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on Atari and Minecraft show that our proposed method outperforms baselines and in some tasks even achieves performance comparable to the policy learned from environmental rewards. These results shed light on the potential of utilizing video-only data to solve difficult visual reinforcement learning tasks rather than relying on complete offline datasets containing states, actions, and rewards.
Researcher Affiliation	Academia	Bohan Zhou1 Ke Li2 Jiechuan Jiang1 Zongqing Lu1,2 1 School of Computer Science, Peking University 2 Beijing Academy of Artificial Intelligence
Pseudocode	Yes	Algorithm 1 in Appendix A details the offline pretraining of the STG Transformer. Algorithm 2 Online Reinforcement Learning with Intrinsic Rewards.
Open Source Code	Yes	The project s website and code can be found at https://sites.google.com/view/stgtransformer.
Open Datasets	Yes	For Qbert and Space Invaders, we collect the last 105 transitions (around 50 trajectories) from Google Dopamine [43] DQN replay experiences. For Breakout and Freeway, we alternatively train a SAC agent [44] from scratch for 5 106 steps and leverage the trained policy to gather approximately 50 trajectories (around 105 transitions) in each game to construct the expert dataset. Recently, various algorithms, e.g., Plan4MC [16] and CLIP4MC [46] have been proposed for Minecraft tasks. To create expert datasets, for each task, we utilize the learned policies of these two algorithms to collect 100 trajectories (around 5 104 observations).
Dataset Splits	No	The paper does not explicitly provide traditional training/validation/test dataset splits with percentages, absolute counts, or references to predefined splits for reproduction. For reinforcement learning experiments, data is often generated through interaction with an environment rather than using a static pre-split dataset for training, validation, and testing in the supervised learning sense.
Hardware Specification	Yes	Type of GPUs A100, or Nvidia RTX 4090 Ti
Software Dependencies	No	The paper mentions various optimizers (Adam, RMSprop) and algorithms/frameworks (PPO, SAC, GPT, WGAN, Minedojo, SIL, Dopamine, Plan4MC, CLIP4MC) but does not provide specific version numbers for any software libraries, programming languages, or environments (e.g., Python, PyTorch, TensorFlow, Gym) that would be needed for reproducible setup.
Experiment Setup	Yes	Table 3: Hyperparameters for Offline Pretraining, Table 4: General Hyperparameters for PPO, and Table 5: Specific Hyperparameters for Different Tasks explicitly list various hyperparameters and training settings such as learning rates, batch sizes, optimizer types, discount factors, clip ratios, and coefficients for different loss components.