Counterfactual Evaluation of Peer-Review Assignment Policies
Authors: Martin Saveski, Steven Jecmen, Nihar Shah, Johan Ugander
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our methods to peer-review data from two computer science venues: the TPDP 21 workshop (95 papers and 35 reviewers) and the AAAI 22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers bids vs. textual similarity (between the review s past papers and the submission), and (ii) the cost of randomization , capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. |
| Researcher Affiliation | Academia | Martin Saveski University of Washington msaveski@uw.edu Steven Jecmen Carnegie Mellon University sjecmen@cs.cmu.edu Nihar B. Shah Carnegie Mellon University nihars@cs.cmu.edu Johan Ugander Stanford University jugander@stanford.edu |
| Pseudocode | No | The paper provides mathematical formulations for linear programs in Appendix A but does not include structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our code is available at: https://github.com/msaveski/counterfactual-peer-review. |
| Open Datasets | No | The paper uses data from 'the TPDP 21 workshop' and 'the AAAI 22 conference' but does not provide concrete access information (link, DOI, repository, or formal citation for public access) for these datasets. |
| Dataset Splits | Yes | To evaluate the performance of the models, we randomly split the observed reviewer-paper pairs into train (75%) and test (25%) sets, fit the models on the train set, and measure the mean absolute error (MAE) of the predictions on the test set. To get more robust estimates of the performance, we repeat this process 10 times. In the training phase, we use 10-fold cross-validation to tune the hyperparameters, using MAE as a selection criterion, and retrain the model on the full training set with the best hyperparameters. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware details such as GPU/CPU models, memory, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | Appendix E mentions specific models and methods (e.g., 'logistic regression', 'ridge classification', 'SVD++ collaborative filtering') but does not provide specific version numbers for the software or libraries used to implement them. |
| Experiment Setup | Yes | In the training phase, we use 10-fold cross-validation to tune the hyperparameters, using MAE as a selection criterion, and retrain the model on the full training set with the best hyperparameters. We also discuss choices for parameters like wtext and λbid and randomized assignment parameter q. |