Minimax Value Interval for Off-Policy Evaluation and Policy Optimization

Authors: Nan Jiang, Jiawei Huang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide preliminary empirical results to support the theoretical predictions, that (1) which bound is the upper bound depends on the expressivity of function classes (Sec. 4.3), and (2) our interval is tighter than the naïve intervals based on previous methods (App.E). We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0). We use neural nets for Q and W, and optimize the losses using stochastic gradient descent ascent (SGDA);7see App. G for more details. Fig. 1 demonstrates the interval reversal phenomenon, where we compute LBq( UBw) and UBq( LBw) for Q-networks of different sizes while fixing everything else.
Researcher Affiliation Academia Nan Jiang Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 nanjiang@illinois.edu Jiawei Huang Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 jiaweih@illinois.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any statements about releasing code or links to source code repositories for the described methodology.
Open Datasets Yes We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0).
Dataset Splits No The paper does not provide specific details on training, validation, or test dataset splits. It mentions
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies No The paper mentions using “neural nets for Q and W” and “stochastic gradient descent ascent (SGDA)” but does not provide specific software dependencies with version numbers.
Experiment Setup Yes We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0). We use neural nets for Q and W, and optimize the losses using stochastic gradient descent ascent (SGDA);7see App. G for more details.