Off-Policy Evaluation via the Regularized Lagrangian

Authors: Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, Dale Schuurmans

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically verify the theoretical findings. We evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher), under linear and neural network parametrizations, with offline data collected from behavior policies with different noise levels (π1 and π2).
Researcher Affiliation Collaboration Mengjiao Yang 1 Ofir Nachum 1 Bo Dai 1 Lihong Li1 Dale Schuurmans1,2 1Google Research 2University of Alberta
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of source code or a direct link to a code repository for the methodology described.
Open Datasets Yes We evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher)... [3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. arXiv preprint arXiv:1606.01540, 2016.
Dataset Splits No The paper mentions using a "fixed dataset D" and "offline data collected from behavior policies", but does not provide specific details on how this data was split into training, validation, or test sets for reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions using JAX for implementation, but does not provide specific version numbers for JAX or any other software dependencies needed to replicate the experiment.
Experiment Setup Yes All of our experiments are run for 200k train steps, with 10k gradient updates. Learning rates for primal and dual variables are tuned for each experiment, and are 1e-4 for all Cartpole experiments, 3e-4 for Grid and Reacher when using linear parametrization and 1e-4 for Grid and Reacher with neural network parametrization. We use Adam optimizer. For neural network parametrization, we use a two-layer neural network with 256 units for Q-function and 512 units for ζ-function. We use ReLU activations. We perform 100 updates of the dual variables for every 1 update of the primal variable to facilitate optimization.