OxonFair: A Flexible Toolkit for Algorithmic Fairness

Authors: Eoin Delaney, Zihao Fu, Sandra Wachter, Brent Mittelstadt, Chris Russell

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present Oxon Fair, a new open source toolkit for enforcing fairness in binary classification. Compared to existing toolkits: (i) We support NLP and Computer Vision classification as well as standard tabular problems. (ii) We support enforcing fairness on validation data, making us robust to a wide range of overfitting challenges. (iii) Our approach can optimize any measure based on True Positives, False Positive, False Negatives, and True Negatives. (iv) We jointly optimize a performance objective alongside fairness constraints. This minimizes degradation while enforcing fairness, and even improves the performance of inadequately tuned unfair baselines. Oxon Fair is compatible with standard ML toolkits, including sklearn, Autogluon, and Py Torch and is available at https://github.com/oxfordinternetinstitute/oxonfair. ... 5 Experimental Analysis
Researcher Affiliation Academia Eoin Delaney University of Oxford Zihao Fu University of Oxford Sandra Wachter University of Oxford Brent Mittelstadt University of Oxford firstname.lastname@oii.ox.ac.uk Chris Russell University of Oxford
Pseudocode No The paper describes the algorithms and processes in textual form and through diagrams (e.g., Figure 2: 'Left: Summary of the fast path algorithm for inferred attributes (Section 4.1)'), but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Oxon Fair is compatible with standard ML toolkits, including sklearn, Autogluon, and Py Torch and is available at https://github.com/oxfordinternetinstitute/oxonfair.
Open Datasets Yes We compare with all group fairness methods offered by AIF360, and the reductions approach of Fairlearn. Oxon Fair is compatible with any learner with an implementation of the method predict_proba consistent with scikit-learn [18] including Auto Gluon [71] and XGBoost [19]. A comparison with Fairlearn and the group methods from AIF360 on the adult dataset can be seen in Figures 1 and 6 using random forests. ... Celeb A [79]: We use the standard aligned & cropped partitions frequently used in fairness evaluation [17, 21, 28, 41]. ... We conducted experiments on hate speech detection and toxicity classification using two datasets: the multilingual Twitter corpus [84] and Jigsaw [85].
Dataset Splits Yes We compare with all group fairness methods offered by AIF360, and the reductions approach of Fairlearn. Oxon Fair is compatible with any learner with an implementation of the method predict_proba consistent with scikit-learn [18] including Auto Gluon [71] and XGBoost [19]. A comparison with Fairlearn and the group methods from AIF360 on the adult dataset can be seen in Figures 1 and 6 using random forests. This follows the setup of [9]: we enforce fairness with respect to race and binarize the attribute to white vs everyone else (this is required to compare with AIF360), 50% train data, 20% validation, and 30% test, and a minimum leaf size of 20. ... Table 16: Multilingual Twitter corpus train/val/test statistics.
Hardware Specification Yes Computer vision experiments were conducted using a NVIDIA RTX 3500 Ada GPU with 12GB of RAM. ... All experiments are conducted on an NVIDIA A100 80GB GPU.
Software Dependencies No The paper mentions software like 'sklearn', 'Autogluon', 'Py Torch', 'XGBoost', and 'Adam' as part of the experimental setup and toolkit compatibility. However, it does not specify explicit version numbers for these software dependencies, which are required for a reproducible description.
Experiment Setup Yes This follows the setup of [9]: we enforce fairness with respect to race and binarize the attribute to white vs everyone else (this is required to compare with AIF360), 50% train data, 20% validation, and 30% test, and a minimum leaf size of 20. ... Implementation Details We follow Wang et al. s setup [41]. We use a Resnet-50 backbone [80] trained on Image Net [81]. ... Dropout [82] (p = 0.5) is applied. All models are trained with a batch size of 32 is and using Adam [83] (learning rate 1e-4). We train for 20 epochs and select the model with highest validation accuracy. Images are center-cropped and resized to 224 × 224. During training, we randomly crop and horizontally flip the images.