Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Re-weighting Based Group Fairness Regularization via Classwise Robust Optimization
Authors: Sangwon Jung, Taeeon Park, Sanghyuk Chun, Taesup Moon
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that Fair DRO is scalable and easily adaptable to diverse applications, and consistently achieves the state-of-the-art performance on several benchmark datasets in terms of the accuracy-fairness trade-off, compared to recent strong baselines. |
| Researcher Affiliation | Collaboration | 1 Department of Electrical and Computer Engineering, Seoul National University 2 NAVER AI Lab 3 ASRI/INMC/IPAI/AIIS, Seoul National University |
| Pseudocode | Yes | Algorithm 1: Fair DRO Iterative Optimization |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code availability for the methodology described. |
| Open Datasets | Yes | Two tabular datasets, UCI Adult (Dua et al., 2017) (Adult) and Pro Publica COMPAS (Julia Angwin & Kirchner, 2016) (COMPAS), are used for the benchmark. We also evaluate Fair DRO on UTKFace (Zhang et al., 2017), a face dataset with multi-class and multi-group labels. Civil Comments-WILDS (Koh et al., 2021). The description and results on another vision dataset, Celeb A (Liu et al., 2015), are given in Appendix D.2. |
| Dataset Splits | No | The paper mentions evaluating models on "separate test sets" but does not explicitly provide details about training/validation/test dataset splits, percentages, or specific counts for a validation set. |
| Hardware Specification | Yes | Experiments are performed on a server with AMD Ryzen Threadripper PRO 3975WX CPUs and NVIDIA RTX A5000 GPUs. |
| Software Dependencies | No | The paper states, "We used Py Torch (Paszke et al., 2019)", but it does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | For tabular and vision datasets, we train all models with the Adam W optimizer (Loshchilov & Hutter, 2019) for 70 epochs. We set the mini-batch size and the weight decay as 128 and 0.001, respectively. The initial learning rate is set as 0.001 and decayed by cosine annealing technique (Loshchilov & Hutter, 2017). For the language dataset, we fine-tune pre-trained BERT with the Adam W optimizer for 3 epochs. We set the mini-batch size and the weight decay as 24 and 0.001, respectively. The initial learning rate is set as 0.00002 and adjusted with a learning rate schedule using a warm-up phase followed by a linear decay. |