reproducibilityindex.ai

Counterfactual Evaluation of Peer-Review Assignment Policies

Authors: Martin Saveski, Steven Jecmen, Nihar Shah, Johan Ugander

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our methods to peer-review data from two computer science venues: the TPDP 21 workshop (95 papers and 35 reviewers) and the AAAI 22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers bids vs. textual similarity (between the review s past papers and the submission), and (ii) the cost of randomization , capturing the difference in expected quality between the perturbed and unperturbed optimal match. We ﬁnd that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality.
Researcher Affiliation	Academia	Martin Saveski University of Washington msaveski@uw.edu Steven Jecmen Carnegie Mellon University sjecmen@cs.cmu.edu Nihar B. Shah Carnegie Mellon University nihars@cs.cmu.edu Johan Ugander Stanford University jugander@stanford.edu
Pseudocode	No	The paper provides mathematical formulations for linear programs in Appendix A but does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Our code is available at: https://github.com/msaveski/counterfactual-peer-review.
Open Datasets	No	The paper uses data from 'the TPDP 21 workshop' and 'the AAAI 22 conference' but does not provide concrete access information (link, DOI, repository, or formal citation for public access) for these datasets.
Dataset Splits	Yes	To evaluate the performance of the models, we randomly split the observed reviewer-paper pairs into train (75%) and test (25%) sets, ﬁt the models on the train set, and measure the mean absolute error (MAE) of the predictions on the test set. To get more robust estimates of the performance, we repeat this process 10 times. In the training phase, we use 10-fold cross-validation to tune the hyperparameters, using MAE as a selection criterion, and retrain the model on the full training set with the best hyperparameters.
Hardware Specification	No	The paper does not explicitly mention any specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments.
Software Dependencies	No	Appendix E mentions specific models and methods (e.g., 'logistic regression', 'ridge classiﬁcation', 'SVD++ collaborative ﬁltering') but does not provide specific version numbers for the software or libraries used to implement them.
Experiment Setup	Yes	In the training phase, we use 10-fold cross-validation to tune the hyperparameters, using MAE as a selection criterion, and retrain the model on the full training set with the best hyperparameters. We also discuss choices for parameters like wtext and λbid and randomized assignment parameter q.