reproducibilityindex.ai

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Authors: Michael R Zhang, Thomas Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, ziyu wang, Mohammad Norouzi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of ofﬂine Mu Jo Co datasets, and ﬁnd that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art.
Researcher Affiliation	Collaboration	Michael R. Zhang 1 Tom Le Paine2 Oﬁr Nachum3 Cosmin Paduraru2 George Tucker3 Ziyu Wang3 Mohammad Norouzi3 1University of Toronto 2Deep Mind 3Google Brain
Pseudocode	Yes	Algorithm 1 Model-based OPE
Open Source Code	No	The paper does not contain an explicit statement about releasing its source code, nor does it provide a link to a code repository for the described methodology.
Open Datasets	Yes	We use the ofﬂine datasets from RL Unplugged (Gulcehre et al., 2020), the details of which are provided in Table 1.
Dataset Splits	Yes	We allocate 80% of the data for training and 20% of the data for model selection.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud computing instance specifications) used for running the experiments.
Software Dependencies	No	The paper mentions using 'Adam' as an optimizer and 'Mu Jo Co' as a physics engine, but it does not specify version numbers for these or any other software components (e.g., programming languages, libraries, frameworks).
Experiment Setup	Yes	We perform a thorough hyperparameter sweep in the experiments and use standard practice from generative modeling to improve the quality of the models. We allocate 80% of the data for training and 20% of the data for model selection. We vary the depth and width of the neural networks (number of layers {3, 4}, layer size {512, 1024}), add different amounts of noise to input states and actions, and consider two levels of weight decay for regularization (input noise {0, 1e 6, 1e 7}, weight decay {0, 1e 6}). For the choice of optimizer, we consider both Adam (Kingma and Ba, 2014) and SGD with momentum and ﬁnd Adam to be more effective at maximizing log-likelihood across all tasks in preliminary experiments. We thus use Adam in all of our experiments with two learning rates {1e 3, 3e 4}. We decay the optimizer s learning rate linearly to zero throughout training, ﬁnding this choice to outperform a constant learning rate. Lastly, we ﬁnd that longer training often improves log-likelihood results. We use 500 epochs for training ﬁnal models.