Learning from Visual Observation via Offline Pretrained State-to-Go Transformer
Authors: Bohan Zhou, Ke Li, Jiechuan Jiang, Zongqing Lu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on Atari and Minecraft show that our proposed method outperforms baselines and in some tasks even achieves performance comparable to the policy learned from environmental rewards. These results shed light on the potential of utilizing video-only data to solve difficult visual reinforcement learning tasks rather than relying on complete offline datasets containing states, actions, and rewards. |
| Researcher Affiliation | Academia | Bohan Zhou1 Ke Li2 Jiechuan Jiang1 Zongqing Lu1,2 1 School of Computer Science, Peking University 2 Beijing Academy of Artificial Intelligence |
| Pseudocode | Yes | Algorithm 1 in Appendix A details the offline pretraining of the STG Transformer. Algorithm 2 Online Reinforcement Learning with Intrinsic Rewards. |
| Open Source Code | Yes | The project s website and code can be found at https://sites.google.com/view/stgtransformer. |
| Open Datasets | Yes | For Qbert and Space Invaders, we collect the last 105 transitions (around 50 trajectories) from Google Dopamine [43] DQN replay experiences. For Breakout and Freeway, we alternatively train a SAC agent [44] from scratch for 5 106 steps and leverage the trained policy to gather approximately 50 trajectories (around 105 transitions) in each game to construct the expert dataset. Recently, various algorithms, e.g., Plan4MC [16] and CLIP4MC [46] have been proposed for Minecraft tasks. To create expert datasets, for each task, we utilize the learned policies of these two algorithms to collect 100 trajectories (around 5 104 observations). |
| Dataset Splits | No | The paper does not explicitly provide traditional training/validation/test dataset splits with percentages, absolute counts, or references to predefined splits for reproduction. For reinforcement learning experiments, data is often generated through interaction with an environment rather than using a static pre-split dataset for training, validation, and testing in the supervised learning sense. |
| Hardware Specification | Yes | Type of GPUs A100, or Nvidia RTX 4090 Ti |
| Software Dependencies | No | The paper mentions various optimizers (Adam, RMSprop) and algorithms/frameworks (PPO, SAC, GPT, WGAN, Minedojo, SIL, Dopamine, Plan4MC, CLIP4MC) but does not provide specific version numbers for any software libraries, programming languages, or environments (e.g., Python, PyTorch, TensorFlow, Gym) that would be needed for reproducible setup. |
| Experiment Setup | Yes | Table 3: Hyperparameters for Offline Pretraining, Table 4: General Hyperparameters for PPO, and Table 5: Specific Hyperparameters for Different Tasks explicitly list various hyperparameters and training settings such as learning rates, batch sizes, optimizer types, discount factors, clip ratios, and coefficients for different loss components. |