Evaluating Model Bias Requires Characterizing its Mistakes

Authors: Isabela Albuquerque, Jessica Schrouff, David Warde-Farley, Ali Taylan Cemgil, Sven Gowal, Olivia Wiles

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the utility of SKEWSIZE in multiple settings including: standard vision models trained on synthetic data, vision models trained on IMAGENET, and large scale vision-and-language models from the BLIP-2 family. In each case, the proposed SKEWSIZE is able to highlight biases not captured by other metrics, while also providing insights on the impact of recently proposed techniques, such as instruction tuning.
Researcher Affiliation Industry 1Google DeepMind. Correspondence to: Isabela Albuquerque <isabelaa@google.com>.
Pseudocode Yes In the Appendix we provide pseudocode for SKEWSIZE (Alg. 1), a Python implementation, and a discussion on modulating the impact low count predictions might have. . . Algorithm 1 Computing SKEWSIZE
Open Source Code No The paper mentions a "Python implementation" and provides a code snippet in Appendix G.2 with an SPDX-License-Identifier. However, it does not explicitly state that the full source code for the methodology described in the paper is released or provide a direct link to a code repository for it.
Open Datasets Yes We use the DSPRITES dataset, which contains images of objects represented by different shapes, colors and at different positions. . . We chose 200 classes from the original label set (specifically, those present in TINYIMAGENET (Le & Yang, 2015)) and generated a synthetic dataset. . . We consider the DOMAINNET benchmark (Peng et al., 2019). . . Apart from Visogender (Hall et al., 2023b), with only 500 instances, there are no real-world datasets available for evaluating gender biases on VLMs. Therefore, to investigate the utility of SKEWSIZE in the evaluation of VLMs, we gather synthetic data with templates constructed as follows.
Dataset Splits No The paper mentions evaluating models on "held out data" for DSPRITES and training on "train split" for DOMAINNET, but does not provide specific percentages, sample counts, or explicit details for training, validation, and test splits that would allow for exact reproduction of data partitioning.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments. It only mentions training models like ResNet18 and ViT.
Software Dependencies No The paper mentions using Python for implementation and various models (ResNet, ViT, Inception, BLIP-2) and Stable Diffusion, but it does not specify version numbers for any software libraries or dependencies, which is required for reproducibility.
Experiment Setup No The paper states that ResNet18 models were trained "for 5k steps" and refers to "Architecture and training details are described in Appendix E." However, Appendix E does not provide specific hyperparameters such as learning rate, batch size, optimizer, or other detailed training configurations necessary for reproduction.