Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Individual Arbitrariness and Group Fairness
Authors: Carol Long, Hsiang Hsu, Wael Alghamdi, Flavio Calmon
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results to show that arbitrariness is masked by favorable group-fairness and accuracy metrics for multiple fairness intervention methods, baseline models, and datasets 7. We also demonstrate the effectiveness of the ensemble in reducing the predictive multiplicity of fair models. |
| Researcher Affiliation | Academia | John A. Paulson School of Engineering and Applied Sciences, Harvard University, Boston, MA 02134. Emails: EMAIL, EMAIL, EMAIL. |
| Pseudocode | No | The paper describes methods in paragraph text and does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code can be found at https://github.com/Carol-Long/Fairness_and_Arbitrariness |
| Open Datasets | Yes | We report predictive multiplicity and benchmark the ensemble method on three datasets two datasets in the education domain: the high-school longitudinal study (HSLS) dataset [27, 28] and the ENEM dataset [16] (see Alghamdi et al. [2] Appendix B.1), and the UCI Adult dataset[33] which is based on the US census income data. |
| Dataset Splits | Yes | First, split the data into training, validation, and test dataset. ... We use the validation set to measure \epsilon corresponding to this empirical Rashomon Set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like Scikit-learn, AIF360 toolkits, and PANDAS package, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | For logistic regression and gradient boosting, the default hyperparameter is used; for random forest, we set the number of trees and minimum number of samples per leaf to 10 to prevent over-fitting. To get 10 competing models for each hypothesis class, we use 10 random seeds (specifically 33 42). |