reproducibilityindex.ai

Do Transformer World Models Give Better Policy Gradients?

Authors: Michel Ma, Tianwei Ni, Clement Gehring, Pierluca D’Oro, Pierre-Luc Bacon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that AWMs can generate optimization landscapes that are easier to navigate even when compared to those from the simulator itself. This property allows transformer AWMs to produce better policies than competitive baselines in realistic long-horizon tasks. Finally, through a series of experiments, we showcase the remarkable empirical properties of backpropagation-based policy optimization with AWMs.
Researcher Affiliation	Academia	1Mila, Montreal, Canada 2Department of Computer Science, University of Montreal, Canada.
Pseudocode	Yes	Algorithm 1 Backpropagation-based Policy Optimization (BPO)
Open Source Code	Yes	Code is available at https://github.com/micklethepickle/actionconditioned-world-models.
Open Datasets	Yes	We focus on eight tasks from the Myriad testbed (Howe et al., 2022).
Dataset Splits	No	No explicit statement specifying train/validation/test split percentages, sample counts, or specific predefined split citations within this paper.
Hardware Specification	No	No specific hardware models (GPU, CPU, TPU) or detailed computer specifications are provided for running experiments, only a general acknowledgment of 'computational resources'.
Software Dependencies	No	The self-attention transformers use a similar architecture to the GPT-2 model (Radford et al., 2019) implemented by the Hugging Face Transformer library (Wolf et al., 2019). No specific version numbers for software components are provided.
Experiment Setup	Yes	Appendix B.1 for hyperparameters of all experiments).", "Table 1. Hyper-parameters for all BPO algorithms that do not pertain to the world model hyper-parameters.", "Table 2. Hyper-parameters for all model-free results on the Myriad environments."