Counterfactual Evaluation of Peer-Review Assignment Policies

Authors: Martin Saveski, Steven Jecmen, Nihar Shah, Johan Ugander

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our methods to peer-review data from two computer science venues: the TPDP 21 workshop (95 papers and 35 reviewers) and the AAAI 22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers bids vs. textual similarity (between the review s past papers and the submission), and (ii) the cost of randomization , capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality.
Researcher Affiliation Academia Martin Saveski University of Washington msaveski@uw.edu Steven Jecmen Carnegie Mellon University sjecmen@cs.cmu.edu Nihar B. Shah Carnegie Mellon University nihars@cs.cmu.edu Johan Ugander Stanford University jugander@stanford.edu
Pseudocode No The paper provides mathematical formulations for linear programs in Appendix A but does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our code is available at: https://github.com/msaveski/counterfactual-peer-review.
Open Datasets No The paper uses data from 'the TPDP 21 workshop' and 'the AAAI 22 conference' but does not provide concrete access information (link, DOI, repository, or formal citation for public access) for these datasets.
Dataset Splits Yes To evaluate the performance of the models, we randomly split the observed reviewer-paper pairs into train (75%) and test (25%) sets, fit the models on the train set, and measure the mean absolute error (MAE) of the predictions on the test set. To get more robust estimates of the performance, we repeat this process 10 times. In the training phase, we use 10-fold cross-validation to tune the hyperparameters, using MAE as a selection criterion, and retrain the model on the full training set with the best hyperparameters.
Hardware Specification No The paper does not explicitly mention any specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments.
Software Dependencies No Appendix E mentions specific models and methods (e.g., 'logistic regression', 'ridge classification', 'SVD++ collaborative filtering') but does not provide specific version numbers for the software or libraries used to implement them.
Experiment Setup Yes In the training phase, we use 10-fold cross-validation to tune the hyperparameters, using MAE as a selection criterion, and retrain the model on the full training set with the best hyperparameters. We also discuss choices for parameters like wtext and λbid and randomized assignment parameter q.