Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Uncovering Latent Biases in Text: Method and Application to Peer Review
Authors: Emaad Manzoor, Nihar B. Shah4767-4775
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our framework to quantify biases in the text of peer reviews from a reputed machine-learning conference before and after the conference adopted a double-blind reviewing policy. We show evidence of biases in the review ratings that serves as ground truth , and show that our proposed framework accurately detects the presence (and absence) of these biases from the review text without having access to the review ratings. |
| Researcher Affiliation | Academia | Emaad Manzoor , Nihar B. Shah Carnegie Mellon University EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology in prose and mathematical equations but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Reproducibility: We make our code and data publicly available at http://emaadmanzoor.com/biases-in-text/. |
| Open Datasets | Yes | We assemble a dataset of 16,880 peer reviews from the Open Review platform for all the 5,638 papers submitted to the International Conference on Learning Representations (ICLR) from 2017 to 2020. [...] Reproducibility: We make our code and data publicly available at http://emaadmanzoor.com/biases-in-text/. |
| Dataset Splits | Yes | We estimate the value of perf(f; t) and perf(g; t) using k-fold crossvalidation. To eliminate any dependence on the choice of cross-validation folds, we repeat the bias estimation procedure many times with the data belonging to each fold randomized uniformly in each iteration. [...] We report results with multinomial Naive Bayes classifiers for f( ) and g( ) and the AUC as our chosen measure of classification performance. We estimate the value of perf(f; t) and perf(g; t) using 10-fold crossvalidation. |
| Hardware Specification | No | The paper does not specify any hardware used for running the experiments, such as CPU or GPU models, or cloud computing resources. |
| Software Dependencies | No | The paper mentions using 'multinomial Naive Bayes classifiers' and 'gender package' (from a GitHub link), but does not provide specific version numbers for any software dependencies like Python, scikit-learn, or other libraries. |
| Experiment Setup | Yes | We use multinomial Naive Bayes classifiers with add-one smoothing for f( ) and g( ) on frequencies of unigrams and bigrams in the review and abstract text respectively. We use the area under the ROC curve (AUC) for both perf(f; t) and perf(g; t), estimated using 10-fold crossvalidation. We downsample the reviews and abstracts in year t DB to equalize the sample sizes and subgroup proportions in t SB and t DB, as described in Section . We repeat the bias estimation procedure 1,000 times with downsampling and the cross-validation folds randomized uniformly in each iteration. |