reproducibilityindex.ai

Understanding Bias in Large-Scale Visual Datasets

Authors: Boya Zeng, Yida Yin, Zhuang Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We apply our framework to three popular large-scale visual datasets: YFCC, CC, and Data Comp, following [40].
Researcher Affiliation	Collaboration	Boya Zeng Yida Yin Zhuang Liu University of Pennsylvania UC Berkeley Meta FAIR
Pseudocode	No	The paper does not contain pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Our project page and code are available at boyazeng.github.io/understand_bias. We will release our code before the conference date.
Open Datasets	Yes	Based on Liu and He [40], we take YFCC100M [66], CC12M [11], and Data Comp-1B [19] (collectively referred to as YCD ) and study their bias in this work.
Dataset Splits	Yes	Specifically, we randomly sample 1M and 10K images from each dataset as training and validation sets, respectively.
Hardware Specification	Yes	We use 8 NVIDIA 2080 Ti to train the Conv Ne Xt-T model for the dataset classification task with 8 gradient accumulation steps. The average compute time for each experiment is about 1.5 days.
Software Dependencies	Yes	We employ the same Conv Ne Xt-Tiny image classification model [41], we use LLa VA 1.5 [39, 38]... we finetune the Sentence T5-base [50] model. Reconstructing the images with a pre-trained VAE from Stable Diffusion [58]. We train an unconditional Diffusion Transformer (Di T) [52]... SDXL-Turbo [59] diffusion model, Claude 3.5-Sonnet [2] and Llama-3.1-8B-Instruct [17], MPNet-Base [63] and Sentence-BERTBase [57].
Experiment Setup	Yes	Table 1 details our default training recipe for dataset classification in Section 3. config value optimizer Adam W [42] learning rate 1e-3 weight decay 0.3 optimizer momentum β1, β2=0.9, 0.95 batch size 4096 learning rate schedule cosine decay warmup epochs 2 training epochs 30 augmentation Random Resized Crop [65] & Rand Aug (9, 0.5) [13] label smoothing 0.1 mixup [80] 0.8 cutmix [77] 1.0