Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review

Authors: Ivan Stelmakh, Nihar Shah, Aarti Singh

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our fifth and final contribution comprises empirical evaluations. We designed and conducted an experiment on the Amazon Mechanical Turk crowdsourcing platform to objectively compare the performance of different reviewer-assignment algorithms. The design of the experiment is done carefully to circumvent the challenge posed by the absence of a ground truth in peer review settings, so that we can evaluate accuracy objectively. In addition to the MTurk experiment, we provide an extensive evaluation of our algorithm on synthetic data, provide an evaluation on a reconstructed similarity matrix from the ICLR 2018 conference, and report the results of the experiment on real conference data conducted by Kobren et al. (2019).
Researcher Affiliation Academia Ivan Stelmakh EMAIL Nihar Shah EMAIL Aarti Singh EMAIL School of Computer Science Carnegie Mellon University 5000 Forbes Ave, Pittsburgh, PA 15213
Pseudocode Yes Algorithm 1 Peer Review4All Algorithm Input: λ [n]: number of reviewers required per paper S [0, 1]n m: similarity matrix µ [m]: reviewers maximum load f: transformation of similarities Output: Reviewer assignment APR4A f
Open Source Code Yes The data set pertaining to the MTurk experiment, as well as the code for our Peer Review4All algorithm, are available on the first author s website.
Open Datasets Yes The data set pertaining to the MTurk experiment, as well as the code for our Peer Review4All algorithm, are available on the first author s website.
Dataset Splits Yes In each of the 6 regions, we first split the 10 questions into two sets: a gold standard set of 8 questions chosen uniformly at random and an unresolved set comprising the 2 remaining questions.
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments.
Software Dependencies No The paper mentions other tools like the Toronto Paper Matching System (TPMS) and its open-source code for constructing a similarity matrix, but it does not specify version numbers for any software dependencies used in their own experimental setup or for the Peer Review4All algorithm itself.
Experiment Setup Yes We consider the instance of the reviewer assignment problem with m = n = 100 and λ = µ = 4. [...] In each of these assignments, every question was answered by λ = 3 workers and every worker answered at most µ = 2 questions.