Understanding Bias in Large-Scale Visual Datasets

Authors: Boya Zeng, Yida Yin, Zhuang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We apply our framework to three popular large-scale visual datasets: YFCC, CC, and Data Comp, following [40].
Researcher Affiliation Collaboration Boya Zeng Yida Yin Zhuang Liu University of Pennsylvania UC Berkeley Meta FAIR
Pseudocode No The paper does not contain pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our project page and code are available at boyazeng.github.io/understand_bias. We will release our code before the conference date.
Open Datasets Yes Based on Liu and He [40], we take YFCC100M [66], CC12M [11], and Data Comp-1B [19] (collectively referred to as YCD ) and study their bias in this work.
Dataset Splits Yes Specifically, we randomly sample 1M and 10K images from each dataset as training and validation sets, respectively.
Hardware Specification Yes We use 8 NVIDIA 2080 Ti to train the Conv Ne Xt-T model for the dataset classification task with 8 gradient accumulation steps. The average compute time for each experiment is about 1.5 days.
Software Dependencies Yes We employ the same Conv Ne Xt-Tiny image classification model [41], we use LLa VA 1.5 [39, 38]... we finetune the Sentence T5-base [50] model. Reconstructing the images with a pre-trained VAE from Stable Diffusion [58]. We train an unconditional Diffusion Transformer (Di T) [52]... SDXL-Turbo [59] diffusion model, Claude 3.5-Sonnet [2] and Llama-3.1-8B-Instruct [17], MPNet-Base [63] and Sentence-BERTBase [57].
Experiment Setup Yes Table 1 details our default training recipe for dataset classification in Section 3. config value optimizer Adam W [42] learning rate 1e-3 weight decay 0.3 optimizer momentum β1, β2=0.9, 0.95 batch size 4096 learning rate schedule cosine decay warmup epochs 2 training epochs 30 augmentation Random Resized Crop [65] & Rand Aug (9, 0.5) [13] label smoothing 0.1 mixup [80] 0.8 cutmix [77] 1.0