Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can We Automate Scientific Reviewing?

Authors: Weizhe Yuan, Pengfei Liu, Graham Neubig

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experimental results on the test set show that while system-generated reviews are comprehensive, touching upon more aspects of the paper than human-written reviews, the generated texts are less constructive and less factual than human-written reviews for all aspects except the explanation of the core ideas of the papers, which are largely factually correct. In this section, we investigate using our proposed review generation systems with state-of-the-art pre-trained models, to what extent can we realize desiderata of reviews that we deﬁned in 2.2. We approach this goal by two concrete questions: (1) What are review generation systems (not) good at? (2) Will systems generate biased reviews? Automatic Evaluation Automatic evaluation metrics include Aspect Coverage (ACov), Aspect Recall (ARec) and Semantic Equivalence (ROUGE, BERTScore). ... The results are shown in Tab. 5.
Researcher Affiliation	Academia	Weizhe Yuan EMAIL New York University, New York, NY 10003 Pengfei Liu EMAIL Carnegie Mellon University, Pittsburgh, PA 15213 Graham Neubig EMAIL Carnegie Mellon University, Pittsburgh, PA 15213
Pseudocode	Yes	We use two steps to extract salient sentences from a source document: (i) Keywords ﬁltering, (ii) Cross-entropy method. ... The algorithm is shown below. 1. For each paper containing n sentences, we ﬁrst assume that each sentence is equally likely to be selected. We start with p0 = (1/2, 1/2, ..., 1/2). Let t := 1. ... 5. If the value of γt hasn t changed for 3 iterations, then stop. Otherwise, set t := t + 1 and return to step 2.
Open Source Code	Yes	We make relevant resource publicly available for use by future research: https://github. com/neulab/Review Advisor.
Open Datasets	Yes	We make relevant resource publicly available for use by future research: https://github. com/neulab/Review Advisor. We crawled ICLR papers from 2017-2020 through Open Review11 and Neur IPS papers from 2016-2019 through Neur IPS Proceedings.12 For each paper s review, we keep as much metadata information as possible. ... Therefore we decided to collect our own dataset Aspect-enhanced Peer Review (ASAP-Review).
Dataset Splits	Yes	This results in 8,742 unique papers and 25,986 paper-review pairs in total, the split of our dataset is shown in Tab. 4. Table 4: Data split of ASAP-Review. Train Unique papers 6,993 Paper-review pairs 20,757. Validation Unique papers 874 Paper-review pairs 2,571. Test Unique papers 875 Paper-review pairs 2,658.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions software like BART (Lewis et al., 2019) and BERT (Devlin et al., 2019) and specific checkpoints (bart-large-cnn, bert-large-cased), but does not provide specific version numbers for these or other key software components (e.g., Python, PyTorch/TensorFlow libraries) used to replicate the experiment.
Experiment Setup	Yes	For all models, we initialized the model weights using the checkpoint: bart-large-cnn... We use the Adam optimizer(Kingma & Ba, 2014) with a linear learning rate scheduler which increases the learning rate linearly from 0 to 4e-5 in the ﬁrst 10% steps (the warmup period) and then decreases the learning rate linearly to 0 throughout the rest of training steps. We ﬁnetuned our models on the whole dataset for 5 epochs. ... During generation, we used beam search decoding with beam size 4. Similarly to training time, we set a minimum length of 100 and a maximum length of 1024. A length penalty of 2.0 and trigram blocking (Paulus et al., 2017) were used as well. We used Adam optimizer (Kingma & Ba, 2014) with a learning rate of 5e-5 to ﬁnetune our model. We trained for 5 epochs and saved the model that achieved lowest loss on validation set as our aspect tagger.