reproducibilityindex.ai

Bootstrapping Fitted Q-Evaluation for Off-Policy Inference

Authors: Botao Hao, Xiang Ji, Yaqi Duan, Hao Lu, Csaba Szepesvari, Mengdi Wang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We numerically evaluate the bootrapping method in classical RL environments for conﬁdence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
Researcher Affiliation	Collaboration	1Deepmind 2Princeton University 3University of Alberta.
Pseudocode	Yes	Algorithm 1 Subsampled Bootstrapping FQE
Open Source Code	No	The paper does not provide an explicit statement about releasing the source code for the described methodology or a direct link to a code repository.
Open Datasets	Yes	We ﬁrst consider the Cliff Walking environment (Sutton & Barto, 2018)... Next we test the methods on the classical Mountain Car environment (Moore, 1990)... We use the septic management simulator by Oberst & Sontag (2019) for our study.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning for training, validation, or testing of a model.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	For constructing conﬁdence intervals, we ﬁx the conﬁdence level at δ = 0.1. We apply the bootstrapping FQE using neural network function approximator with three fully connected layers, where the ﬁrst layer uses 256 units and a Relu activation function, the second layer uses 32 units and a Selu activation function, and the last layer uses Softsign... Let the behavior policy be the 0.15 ϵ-greedy policy.