reproducibilityindex.ai

Off-Policy Evaluation via the Regularized Lagrangian

Authors: Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, Dale Schuurmans

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically verify the theoretical ﬁndings. We evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher), under linear and neural network parametrizations, with ofﬂine data collected from behavior policies with different noise levels (π1 and π2).
Researcher Affiliation	Collaboration	Mengjiao Yang 1 Oﬁr Nachum 1 Bo Dai 1 Lihong Li1 Dale Schuurmans1,2 1Google Research 2University of Alberta
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about the release of source code or a direct link to a code repository for the methodology described.
Open Datasets	Yes	We evaluate different choices of estimators, regularizers, and constraints, on a set of OPE tasks ranging from tabular (Grid) to discrete-control (Cartpole) and continuous-control (Reacher)... [3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. arXiv preprint arXiv:1606.01540, 2016.
Dataset Splits	No	The paper mentions using a "ﬁxed dataset D" and "ofﬂine data collected from behavior policies", but does not provide specific details on how this data was split into training, validation, or test sets for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions using JAX for implementation, but does not provide specific version numbers for JAX or any other software dependencies needed to replicate the experiment.
Experiment Setup	Yes	All of our experiments are run for 200k train steps, with 10k gradient updates. Learning rates for primal and dual variables are tuned for each experiment, and are 1e-4 for all Cartpole experiments, 3e-4 for Grid and Reacher when using linear parametrization and 1e-4 for Grid and Reacher with neural network parametrization. We use Adam optimizer. For neural network parametrization, we use a two-layer neural network with 256 units for Q-function and 512 units for ζ-function. We use ReLU activations. We perform 100 updates of the dual variables for every 1 update of the primal variable to facilitate optimization.