Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Authors: Michael R Zhang, Thomas Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, ziyu wang, Mohammad Norouzi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline Mu Jo Co datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art.
Researcher Affiliation Collaboration Michael R. Zhang 1 Tom Le Paine2 Ofir Nachum3 Cosmin Paduraru2 George Tucker3 Ziyu Wang3 Mohammad Norouzi3 1University of Toronto 2Deep Mind 3Google Brain
Pseudocode Yes Algorithm 1 Model-based OPE
Open Source Code No The paper does not contain an explicit statement about releasing its source code, nor does it provide a link to a code repository for the described methodology.
Open Datasets Yes We use the offline datasets from RL Unplugged (Gulcehre et al., 2020), the details of which are provided in Table 1.
Dataset Splits Yes We allocate 80% of the data for training and 20% of the data for model selection.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud computing instance specifications) used for running the experiments.
Software Dependencies No The paper mentions using 'Adam' as an optimizer and 'Mu Jo Co' as a physics engine, but it does not specify version numbers for these or any other software components (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes We perform a thorough hyperparameter sweep in the experiments and use standard practice from generative modeling to improve the quality of the models. We allocate 80% of the data for training and 20% of the data for model selection. We vary the depth and width of the neural networks (number of layers {3, 4}, layer size {512, 1024}), add different amounts of noise to input states and actions, and consider two levels of weight decay for regularization (input noise {0, 1e 6, 1e 7}, weight decay {0, 1e 6}). For the choice of optimizer, we consider both Adam (Kingma and Ba, 2014) and SGD with momentum and find Adam to be more effective at maximizing log-likelihood across all tasks in preliminary experiments. We thus use Adam in all of our experiments with two learning rates {1e 3, 3e 4}. We decay the optimizer s learning rate linearly to zero throughout training, finding this choice to outperform a constant learning rate. Lastly, we find that longer training often improves log-likelihood results. We use 500 epochs for training final models.