Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization
Authors: Michael R Zhang, Thomas Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, ziyu wang, Mohammad Norouzi
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline Mu Jo Co datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. |
| Researcher Affiliation | Collaboration | Michael R. Zhang 1 Tom Le Paine2 Ofir Nachum3 Cosmin Paduraru2 George Tucker3 Ziyu Wang3 Mohammad Norouzi3 1University of Toronto 2Deep Mind 3Google Brain |
| Pseudocode | Yes | Algorithm 1 Model-based OPE |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its source code, nor does it provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | We use the offline datasets from RL Unplugged (Gulcehre et al., 2020), the details of which are provided in Table 1. |
| Dataset Splits | Yes | We allocate 80% of the data for training and 20% of the data for model selection. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud computing instance specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Adam' as an optimizer and 'Mu Jo Co' as a physics engine, but it does not specify version numbers for these or any other software components (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | We perform a thorough hyperparameter sweep in the experiments and use standard practice from generative modeling to improve the quality of the models. We allocate 80% of the data for training and 20% of the data for model selection. We vary the depth and width of the neural networks (number of layers {3, 4}, layer size {512, 1024}), add different amounts of noise to input states and actions, and consider two levels of weight decay for regularization (input noise {0, 1e 6, 1e 7}, weight decay {0, 1e 6}). For the choice of optimizer, we consider both Adam (Kingma and Ba, 2014) and SGD with momentum and find Adam to be more effective at maximizing log-likelihood across all tasks in preliminary experiments. We thus use Adam in all of our experiments with two learning rates {1e 3, 3e 4}. We decay the optimizer s learning rate linearly to zero throughout training, finding this choice to outperform a constant learning rate. Lastly, we find that longer training often improves log-likelihood results. We use 500 epochs for training final models. |