reproducibilityindex.ai

Distributionally Robust Counterfactual Risk Minimization

Authors: Louis Faury, Ugo Tanielian, Elvis Dohmatob, Elena Smirnova, Flavian Vasile3850-3857

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we show that this approach outperforms the state-of-the-art on four benchmark datasets, validating the relevance of using other uncertainty measures in practical applications. 4 Experimental results It is well known that experiments in the ﬁeld of counterfactual reasoning are highly sensitive to differences in datasets and implementations. Consequently, to evaluate and compare the two algorithms we previously introduced to existing solutions, we rigorously follow the experimental procedure introduced in (Swaminathan and Joachims 2015a) and used in several other works such as (Swaminathan and Joachims 2015b) since then. It relies on a supervised to unsupervised dataset conversion (Agarwal et al. 2014) to build bandit feedback from multi-label classiﬁcation datasets. As in (Swaminathan and Joachims 2015a), we train exponential models πθ(y\|x) exp θT φ(x, y) for the CRM problem and use the same datasets taken from the Lib SVM repository.
Researcher Affiliation	Collaboration	1Criteo AI Lab 2LTCI, Telecom-Paris Tech, Universit e Paris-Saclay 3LPSM, Universit e Paris 6
Pseudocode	Yes	Algorithm 1: a KL-CRM
Open Source Code	No	For reproducibility purposes, we used the code provided by its authors 1 for all our experiments. (Footnote 1 points to POEM code, not the authors' own method code). The paper does not state that their code is open-source or provided.
Open Datasets	Yes	we use the same datasets taken from the Lib SVM repository. For every of the four dataset we consider (Scene, Yeast, RCV1-Topics and TMC2009)
Dataset Splits	Yes	The full supervised dataset is denoted D {(x1, y 1), . . . , (x N, y N)}, and is split into three parts: D train, D valid, D test. For every of the four dataset we consider (Scene, Yeast, RCV1-Topics and TMC2009), the split of the training dataset is done as follows: 75% goes to D train and 25% to D valid.
Hardware Specification	No	No specific hardware details (like GPU/CPU models or cloud instances) are mentioned for running the experiments.
Software Dependencies	No	The paper mentions 'L-BFGS algorithm' but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup	Yes	As in (Swaminathan and Joachims 2015a), the clipping constant M is always set to the ratio of the 90%ile to the 10%ile of the propensity scores observed in logs H. Other hyper-parameters are selected by cross-validation on D valid with the unbiased counterfactual estimator (3). In the experimental results, we also report the performance of the logging policy π0 on the test set as an indicative baseline measure, and the performance of a skyline CRF trained on the whole supervised dataset, despite of its unfair advantage. Every experiment is run 20 times with a different random seed (which controls the random training fraction for the logging policy and the creation of the bandit dataset).