Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
Authors: Nan Jiang, Jiawei Huang
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide preliminary empirical results to support the theoretical predictions, that (1) which bound is the upper bound depends on the expressivity of function classes (Sec. 4.3), and (2) our interval is tighter than the naïve intervals based on previous methods (App.E). We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0). We use neural nets for Q and W, and optimize the losses using stochastic gradient descent ascent (SGDA);7see App. G for more details. Fig. 1 demonstrates the interval reversal phenomenon, where we compute LBq( UBw) and UBq( LBw) for Q-networks of different sizes while fixing everything else. |
| Researcher Affiliation | Academia | Nan Jiang Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 nanjiang@illinois.edu Jiawei Huang Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 jiaweih@illinois.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statements about releasing code or links to source code repositories for the described methodology. |
| Open Datasets | Yes | We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0). |
| Dataset Splits | No | The paper does not provide specific details on training, validation, or test dataset splits. It mentions |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments. |
| Software Dependencies | No | The paper mentions using “neural nets for Q and W” and “stochastic gradient descent ascent (SGDA)” but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We conduct the experiments in Cart Pole, with the target policy being softmax over a pre-trained Q-function with temperature τ (behavior policy is τ = 1.0). We use neural nets for Q and W, and optimize the losses using stochastic gradient descent ascent (SGDA);7see App. G for more details. |