reproducibilityindex.ai

Minimax Value Interval for Off-Policy Evaluation and Policy Optimization

Authors: Nan Jiang, Jiawei Huang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide preliminary empirical results to support the theoretical predictions, that (1) which bound is the upper bound depends on the expressivity of function classes (Sec. 4.3), and (2) our interval is tighter than the naïve intervals based on previous methods (App.E). We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0). We use neural nets for Q and W, and optimize the losses using stochastic gradient descent ascent (SGDA);7see App. G for more details. Fig. 1 demonstrates the interval reversal phenomenon, where we compute LBq( UBw) and UBq( LBw) for Q-networks of different sizes while ﬁxing everything else.
Researcher Affiliation	Academia	Nan Jiang Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 nanjiang@illinois.edu Jiawei Huang Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 jiaweih@illinois.edu
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide any statements about releasing code or links to source code repositories for the described methodology.
Open Datasets	Yes	We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0).
Dataset Splits	No	The paper does not provide specific details on training, validation, or test dataset splits. It mentions
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies	No	The paper mentions using “neural nets for Q and W” and “stochastic gradient descent ascent (SGDA)” but does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0). We use neural nets for Q and W, and optimize the losses using stochastic gradient descent ascent (SGDA);7see App. G for more details.