Do Transformer World Models Give Better Policy Gradients?

Authors: Michel Ma, Tianwei Ni, Clement Gehring, Pierluca D’Oro, Pierre-Luc Bacon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that AWMs can generate optimization landscapes that are easier to navigate even when compared to those from the simulator itself. This property allows transformer AWMs to produce better policies than competitive baselines in realistic long-horizon tasks. Finally, through a series of experiments, we showcase the remarkable empirical properties of backpropagation-based policy optimization with AWMs.
Researcher Affiliation Academia 1Mila, Montreal, Canada 2Department of Computer Science, University of Montreal, Canada.
Pseudocode Yes Algorithm 1 Backpropagation-based Policy Optimization (BPO)
Open Source Code Yes Code is available at https://github.com/micklethepickle/actionconditioned-world-models.
Open Datasets Yes We focus on eight tasks from the Myriad testbed (Howe et al., 2022).
Dataset Splits No No explicit statement specifying train/validation/test split percentages, sample counts, or specific predefined split citations within this paper.
Hardware Specification No No specific hardware models (GPU, CPU, TPU) or detailed computer specifications are provided for running experiments, only a general acknowledgment of 'computational resources'.
Software Dependencies No The self-attention transformers use a similar architecture to the GPT-2 model (Radford et al., 2019) implemented by the Hugging Face Transformer library (Wolf et al., 2019). No specific version numbers for software components are provided.
Experiment Setup Yes Appendix B.1 for hyperparameters of all experiments).", "Table 1. Hyper-parameters for all BPO algorithms that do not pertain to the world model hyper-parameters.", "Table 2. Hyper-parameters for all model-free results on the Myriad environments."