Do Transformer World Models Give Better Policy Gradients?
Authors: Michel Ma, Tianwei Ni, Clement Gehring, Pierluca D’Oro, Pierre-Luc Bacon
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that AWMs can generate optimization landscapes that are easier to navigate even when compared to those from the simulator itself. This property allows transformer AWMs to produce better policies than competitive baselines in realistic long-horizon tasks. Finally, through a series of experiments, we showcase the remarkable empirical properties of backpropagation-based policy optimization with AWMs. |
| Researcher Affiliation | Academia | 1Mila, Montreal, Canada 2Department of Computer Science, University of Montreal, Canada. |
| Pseudocode | Yes | Algorithm 1 Backpropagation-based Policy Optimization (BPO) |
| Open Source Code | Yes | Code is available at https://github.com/micklethepickle/actionconditioned-world-models. |
| Open Datasets | Yes | We focus on eight tasks from the Myriad testbed (Howe et al., 2022). |
| Dataset Splits | No | No explicit statement specifying train/validation/test split percentages, sample counts, or specific predefined split citations within this paper. |
| Hardware Specification | No | No specific hardware models (GPU, CPU, TPU) or detailed computer specifications are provided for running experiments, only a general acknowledgment of 'computational resources'. |
| Software Dependencies | No | The self-attention transformers use a similar architecture to the GPT-2 model (Radford et al., 2019) implemented by the Hugging Face Transformer library (Wolf et al., 2019). No specific version numbers for software components are provided. |
| Experiment Setup | Yes | Appendix B.1 for hyperparameters of all experiments).", "Table 1. Hyper-parameters for all BPO algorithms that do not pertain to the world model hyper-parameters.", "Table 2. Hyper-parameters for all model-free results on the Myriad environments." |