On Measuring Fairness in Generative Models

Authors: Christopher Teo, Milad Abdollahzadeh, Ngai-Man (Man) Cheung

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we conduct, for the first time, an in-depth study on fairness measurement, a critical component in gauging progress on fair generative models. We make three contributions. First, we conduct a study that reveals that the existing fairness measurement framework has considerable measurement errors, even when highly accurate sensitive attribute (SA) classifiers are used. These findings cast doubts on previously reported fairness improvements. Second, to address this issue, we propose CLassifier Error-Aware Measurement (CLEAM), a new framework which uses a statistical model to account for inaccuracies in SA classifiers. Our proposed CLEAM reduces measurement errors significantly, e.g., 4.98% 0.62% for Style GAN2 w.r.t. Gender. Additionally, CLEAM achieves this with minimal additional overhead. Third, we utilize CLEAM to measure fairness in important text-to-image generator and GANs, revealing considerable biases in these models that raise concerns about their applications.
Researcher Affiliation Academia Christopher T. H. Teo christopher_teo@mymail.sutd.edu.sg Milad Abdollahzadeh milad_abdollahzadeh@sutd.sg Ngai-Man Cheung ngaiman_cheung@sutd.edu.sg Singapore University of Technology and Design (SUTD)
Pseudocode Yes Algorithm 1: Computing point and interval estimates using CLEAM. Require: accuracy of SA classifier, α. 1 Compute SA classifier output ˆp : {ˆp1, . . . , ˆps} for s batches of generated data. 2 Compute sample mean µˆp and sample variance σ2 ˆp using (6) and (7). 3 Use (8) to compute point estimate µCLEAM. 4 Use (10) to compute interval estimate ρCLEAM.
Open Source Code Yes Code and more resources: https: //sutd-visual-computing-group.github.io/CLEAM/.
Open Datasets Yes More specifically, we utilize the official publicly released pre-trained Style GAN2 [3] and Style Swin [4] on Celeb A-HQ [27] for sample generation. Then, we randomly sample from these GANs and utilize Amazon Mechanical Turks to hand-label the samples w.r.t. Gender and Black Hair, resulting in 9K samples for each GAN... Next, we follow a similar labeling process w.r.t. Gender, but with a SDM [5] pre-trained on LAION-5B[28]... As CLIP does not have a validation dataset, to measure α for CLIP, we utilize Celeb A-HQ, a dataset with a similar domain to our application.
Dataset Splits No The paper mentions a 'validation stage' for SA classifiers and refers to 'validation results' in supplementary material (Supp D.7). However, it does not provide specific percentages or counts for training/validation/test splits for any dataset used in the main text.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory amounts used for running its experiments. It only implies computations by discussing models and training.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., programming languages, libraries, or frameworks with their respective versions). It mentions using models like ResNet-18, MobileNetV2, and CLIP, but not their software environment versions.
Experiment Setup Yes Here, we follow Choi et al. [1] as the Baseline for measuring fairness. In particular, to calculate each ˆp value for a generator, a corresponding batch of n = 400 samples is randomly drawn from Gen Data and passed into Cu for SA classification. We repeat this for s = 30 batches and report the mean results denoted by µBase and the 95% confidence interval denoted by ρBase. For a comprehensive analysis of the GANs, we repeat the experiment using four different SA classifiers: Resnet-18, Res Net-34 [34], Mobile Netv2 [35], and VGG-16 [36].