Ameliorate Spurious Correlations in Dataset Condensation

Authors: Justin Cui, Ruochen Wang, Yuanhao Xiong, Cho-Jui Hsieh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the condensation process, resulting in a notable decline in the performance of models trained on the condensed dataset, while corruption bias is suppressed through the condensation process. To reduce bias amplification in dataset condensation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method.
Researcher Affiliation Academia Justin Cui 1 Ruochen Wang 1 Yuanhao Xiong 1 Cho-Jui Hsieh 1 1Department of Computer Science, University of California, Los Angeles. Correspondence to: Justin Cui <justincui@ucla.edu>, Cho-Jui Hsieh <chohsieh@cs.ucla.edu>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the methodology described in this paper. The GitHub link provided in the references (Awesome-Dataset-Distillation) is a general list of resources, not the authors' code for this specific work.
Open Datasets Yes In line with prior studies (Nam et al., 2020; Lee et al., 2021; Hwang et al., 2022), we explore three datasets: Colored MNIST (CMNIST), Background Fashion-MNIST (BG FMNIST), and Corrupted CIFAR-10. CMNIST originates from MNIST (Deng, 2012) dataset... Corrupted CIFAR-10 applies different corruptions to the images in CIFAR-10... BFFHQ (Lee et al., 2021) is constructed from FFHQ (Karras et al., 2019)... BG FMNIST which is constructed by using Fashion-MNIST (Xiao et al., 2017) as foregrounds and Mini Places (Zhou et al., 2017) as backgrounds.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits. It mentions using 'unbiased test datasets' for evaluation and 'bias-injected datasets' for condensation, but the specific partitioning of the original data into training, validation, and test sets is not detailed with percentages, counts, or references to predefined splits for their experiments.
Hardware Specification Yes All experiments are run on a single 48GB NVIDIA RTX A6000 GPU.
Software Dependencies No The paper mentions software components like 'Conv Net', 'SGD', and 'Res Net18', but does not provide specific version numbers for programming languages, libraries, or solvers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes Following previous dataset distillation methods (Zhao & Bilen, 2021a; Cazenavette et al., 2022), we use Conv Net as the model architecture. It has 128 filters with kernel size of 3 3. Then it s followed by instance normalization, RELU activation, and an average pooling layer. We use SGD as the optimizer with 0.01 learning rate. For the supervised contrastive model, we use Res Net18 (He et al., 2016b) following (Hwang et al., 2022) with a projection head of 128 dimensions... For KDE, we fix kernel variance and temperature to be 0.1 across all datasets.