reproducibilityindex.ai

Uncovering Latent Biases in Text: Method and Application to Peer Review

Authors: Emaad Manzoor, Nihar B. Shah4767-4775

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our framework to quantify biases in the text of peer reviews from a reputed machine-learning conference before and after the conference adopted a double-blind reviewing policy. We show evidence of biases in the review ratings that serves as ground truth , and show that our proposed framework accurately detects the presence (and absence) of these biases from the review text without having access to the review ratings.
Researcher Affiliation	Academia	Emaad Manzoor , Nihar B. Shah Carnegie Mellon University emaad@cmu.edu, nihars@cs.cmu.edu
Pseudocode	No	The paper describes its methodology in prose and mathematical equations but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	Reproducibility: We make our code and data publicly available at http://emaadmanzoor.com/biases-in-text/.
Open Datasets	Yes	We assemble a dataset of 16,880 peer reviews from the Open Review platform for all the 5,638 papers submitted to the International Conference on Learning Representations (ICLR) from 2017 to 2020. [...] Reproducibility: We make our code and data publicly available at http://emaadmanzoor.com/biases-in-text/.
Dataset Splits	Yes	We estimate the value of perf(f; t) and perf(g; t) using k-fold crossvalidation. To eliminate any dependence on the choice of cross-validation folds, we repeat the bias estimation procedure many times with the data belonging to each fold randomized uniformly in each iteration. [...] We report results with multinomial Naive Bayes classiﬁers for f( ) and g( ) and the AUC as our chosen measure of classiﬁcation performance. We estimate the value of perf(f; t) and perf(g; t) using 10-fold crossvalidation.
Hardware Specification	No	The paper does not specify any hardware used for running the experiments, such as CPU or GPU models, or cloud computing resources.
Software Dependencies	No	The paper mentions using 'multinomial Naive Bayes classiﬁers' and 'gender package' (from a GitHub link), but does not provide specific version numbers for any software dependencies like Python, scikit-learn, or other libraries.
Experiment Setup	Yes	We use multinomial Naive Bayes classiﬁers with add-one smoothing for f( ) and g( ) on frequencies of unigrams and bigrams in the review and abstract text respectively. We use the area under the ROC curve (AUC) for both perf(f; t) and perf(g; t), estimated using 10-fold crossvalidation. We downsample the reviews and abstracts in year t DB to equalize the sample sizes and subgroup proportions in t SB and t DB, as described in Section . We repeat the bias estimation procedure 1,000 times with downsampling and the cross-validation folds randomized uniformly in each iteration.