Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
PeerReview4All: Fair and Accurate Reviewer Assignment in Peer Review
Authors: Ivan Stelmakh, Nihar Shah, Aarti Singh
JMLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our fifth and final contribution comprises empirical evaluations. We designed and conducted an experiment on the Amazon Mechanical Turk crowdsourcing platform to objectively compare the performance of different reviewer-assignment algorithms. The design of the experiment is done carefully to circumvent the challenge posed by the absence of a ground truth in peer review settings, so that we can evaluate accuracy objectively. In addition to the MTurk experiment, we provide an extensive evaluation of our algorithm on synthetic data, provide an evaluation on a reconstructed similarity matrix from the ICLR 2018 conference, and report the results of the experiment on real conference data conducted by Kobren et al. (2019). |
| Researcher Affiliation | Academia | Ivan Stelmakh EMAIL Nihar Shah EMAIL Aarti Singh EMAIL School of Computer Science Carnegie Mellon University 5000 Forbes Ave, Pittsburgh, PA 15213 |
| Pseudocode | Yes | Algorithm 1 Peer Review4All Algorithm Input: λ [n]: number of reviewers required per paper S [0, 1]n m: similarity matrix µ [m]: reviewers maximum load f: transformation of similarities Output: Reviewer assignment APR4A f |
| Open Source Code | Yes | The data set pertaining to the MTurk experiment, as well as the code for our Peer Review4All algorithm, are available on the first author s website. |
| Open Datasets | Yes | The data set pertaining to the MTurk experiment, as well as the code for our Peer Review4All algorithm, are available on the first author s website. |
| Dataset Splits | Yes | In each of the 6 regions, we first split the 10 questions into two sets: a gold standard set of 8 questions chosen uniformly at random and an unresolved set comprising the 2 remaining questions. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments. |
| Software Dependencies | No | The paper mentions other tools like the Toronto Paper Matching System (TPMS) and its open-source code for constructing a similarity matrix, but it does not specify version numbers for any software dependencies used in their own experimental setup or for the Peer Review4All algorithm itself. |
| Experiment Setup | Yes | We consider the instance of the reviewer assignment problem with m = n = 100 and λ = µ = 4. [...] In each of these assignments, every question was answered by λ = 3 workers and every worker answered at most µ = 2 questions. |