Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Off-Policy Evaluation and Learning for External Validity under a Covariate Shift

Authors: Masatoshi Uehara, Masahiro Kato, Shota Yasui

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we conduct experiments to conﬁrm the effectiveness of the proposed estimators. In this section, we demonstrate the effectiveness of the proposed estimators using data obtained with bandit feedback. Following previous work (Dudík et al., 2011; Farajtabar et al., 2018), we evaluate the proposed estimators using the standard classiﬁcation datasets from the UCI repository by transforming the classiﬁcation data into contextual bandit data. From the UCI repository, we use the satimage, vehicle, and pendigits datasets.
Researcher Affiliation	Collaboration	Masatoshi Uehara1 , Masahiro Kato2 , Shota Yasui2 1 Cornell University EMAIL 2Cyber Agent Inc. EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Doubly Robust Estimator under a Covariate Shift
Open Source Code	No	The paper does not contain any statements about making its source code publicly available, nor does it provide a link to a code repository.
Open Datasets	Yes	From the UCI repository, we use the satimage, vehicle, and pendigits datasets. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
Dataset Splits	Yes	By adjusting Cprob, we classify 70% samples as the historical data and 30% samples as the evaluation data. For this estimator, we use 2-fold cross-ﬁtting.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU or CPU models used for the experiments.
Software Dependencies	No	The paper mentions statistical methods and tools like 'kernel Ridge regression', 'Ku LISF', and 'Nadaraya-Watson regression' but does not specify any software names with version numbers for implementation.
Experiment Setup	No	For DRCS, we use 2-fold cross-ﬁtting and add a regularization term. More details, such as the description of the data and choice of hyperparameters, are in Appendix H. The main text does not contain specific hyperparameter values.