Avoiding spurious correlations via logit correction

Authors: Sheng Liu, Xu Zhang, Nitesh Sekhar, Yue Wu, Prateek Singhal, Carlos Fernandez-Granda

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies suggest that machine learning models trained with empirical risk minimization (ERM) often rely on attributes that may be spuriously correlated with the class labels. Our extensive experimental results further reveal that the proposed LC loss outperforms state-of-the-art solutions on multiple popular benchmarks by a large margin, an average 5.5% absolute improvement, without access to spurious attribute labels.
Researcher Affiliation Collaboration Sheng Liu1 Xu Zhang2 Nitesh Sekhar2 Yue Wu2 Prateek Singhal2 Carlos Fernandez-Granda1 1New York University, USA; shengliu@nyu.edu 2Amazon Alexa AI, USA
Pseudocode Yes We provide the pseudo-code of the proposed logit correction and Group Mix Up in Algorithm 1.
Open Source Code Yes Code is available at https://github.com/shengliu66/LC.
Open Datasets Yes In this section, we evaluate the effectiveness of the proposed logit correction (LC) method on five computer vision benchmarks presenting spurious correlations: Colored MNIST (C-MNIST) (Arjovsky et al., 2020), Corrupted CIFAR-10 (C-CIFAR10) (Hendrycks & Dietterich, 2019; Nam et al., 2020), Biased FFHQ (b FFHQ) (Karras et al., 2019; Lee et al., 2021), Waterbird (Wah et al., 2011), and Celeb A (Liu et al., 2015).
Dataset Splits Yes The ratios are set to 0.5%, 1%, 2%, and 5% for both C-MNIST and C-CIFAR-10. For b FFHQ dataset, the model is trained with 0.5% minority ratio and the accuracy is evaluated on the minority group Lee et al. (2021). For Waterbird and Celeb A datasets, we measure the worst group accuracy (Sohoni et al., 2020).
Hardware Specification No The paper mentions network architectures (MLP, ResNet-18, ResNet-50) but does not specify any hardware details such as GPU models, CPU types, or cloud computing resources used for experiments.
Software Dependencies No The paper mentions using 'Adam optimizer with β = (0.9, 0.999)' but does not list any specific software libraries (e.g., PyTorch, TensorFlow) or their version numbers, which are necessary for full reproducibility.
Experiment Setup Yes We utilize Adam optimizer with β = (0.9, 0.999) without weight decay except for Celeb A, we set weight decay to 1 10 4, and a batch size of 256. For Waterbird, we use SGD optimizer with weight decay of 1 10 4. Learning rates of 1 10 2, 1 10 3 and 1 10 4 are used for Colored MNIST, Waterbird, and Celeb A, respectively. We use a learning rate of 5 10 4 for 0.5% ratio of Corrupted CIFAR-10 and 1 10 3 for the remaining ratios. We decay the learning rate at 10k iteration by 0.5 for both Colored MNIST and Corrupted CIFAR10. For Celeb A, we adopt a cosine annealing learning rate schedule. For Waterbird, we set the q in GCE as 0.8 and 0.7 for other datasets. The ramp-up epoch T is set as 50 for waterbird and Celeb A and 2 for other datasets, and the moving average momentum α is set to 0.5 for all datasets.