Off-Policy Evaluation via the Regularized Lagrangian
Authors: Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, Dale Schuurmans
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically verify the theoretical findings. We evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher), under linear and neural network parametrizations, with offline data collected from behavior policies with different noise levels (π1 and π2). |
| Researcher Affiliation | Collaboration | Mengjiao Yang 1 Ofir Nachum 1 Bo Dai 1 Lihong Li1 Dale Schuurmans1,2 1Google Research 2University of Alberta |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about the release of source code or a direct link to a code repository for the methodology described. |
| Open Datasets | Yes | We evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher)... [3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. arXiv preprint arXiv:1606.01540, 2016. |
| Dataset Splits | No | The paper mentions using a "fixed dataset D" and "offline data collected from behavior policies", but does not provide specific details on how this data was split into training, validation, or test sets for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using JAX for implementation, but does not provide specific version numbers for JAX or any other software dependencies needed to replicate the experiment. |
| Experiment Setup | Yes | All of our experiments are run for 200k train steps, with 10k gradient updates. Learning rates for primal and dual variables are tuned for each experiment, and are 1e-4 for all Cartpole experiments, 3e-4 for Grid and Reacher when using linear parametrization and 1e-4 for Grid and Reacher with neural network parametrization. We use Adam optimizer. For neural network parametrization, we use a two-layer neural network with 256 units for Q-function and 512 units for ζ-function. We use ReLU activations. We perform 100 updates of the dual variables for every 1 update of the primal variable to facilitate optimization. |