Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Authors: Botao Hao, Xiang Ji, Yaqi Duan, Hao Lu, Csaba Szepesvari, Mengdi Wang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
Researcher Affiliation Collaboration 1Deepmind 2Princeton University 3University of Alberta.
Pseudocode Yes Algorithm 1 Subsampled Bootstrapping FQE
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the described methodology or a direct link to a code repository.
Open Datasets Yes We first consider the Cliff Walking environment (Sutton & Barto, 2018)... Next we test the methods on the classical Mountain Car environment (Moore, 1990)... We use the septic management simulator by Oberst & Sontag (2019) for our study.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning for training, validation, or testing of a model.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes For constructing confidence intervals, we fix the confidence level at δ = 0.1. We apply the bootstrapping FQE using neural network function approximator with three fully connected layers, where the first layer uses 256 units and a Relu activation function, the second layer uses 32 units and a Selu activation function, and the last layer uses Softsign... Let the behavior policy be the 0.15 ϵ-greedy policy.