An Investigation of Representation and Allocation Harms in Contrastive Learning

Authors: Subha Maity, Mayank Agarwal, Mikhail Yurochkin, Yuekai Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we demonstrate that contrastive learning (CL), a popular variant of SSL, tends to collapse representations of minority groups with certain majority groups. We refer to this phenomenon as representation harm and demonstrate it on image and text datasets using the corresponding popular CL methods. Furthermore, our causal mediation analysis of allocation harm on a downstream classification task reveals that representation harm is partly responsible for it, thus emphasizing the importance of studying and mitigating representation harm. Finally, we provide a theoretical explanation for representation harm using a stochastic block model that leads to a representational neural collapse in a contrastive learning setting.
Researcher Affiliation Collaboration Subha Maity Department of Statistics University of Michigan Ann Arbor, MI smaity@umich.edu Mayank Agarwal IBM Research MIT-IBM Watson Lab Cambridge, MA mayank.agarwal@ibm.com Mikhail Yurochkin IBM Research MIT-IBM Watson Lab Cambridge, MA mikhail.yurochkin@ibm.com Yuekai Sun Department of Statistics University of Michigan Ann Arbor, MI yuekai@umich.edu
Pseudocode No The paper describes computational models and mathematical formulations, but no structured pseudocode or algorithm blocks are provided.
Open Source Code Yes Accompanying codes can be found in https://github.com/smaityumich/CLrepresentation-harm.
Open Datasets Yes For our controlled study, we consider CIFAR10 (Krizhevsky et al., 2009) and We consider BIASBIOS dataset (De-Arteaga et al., 2019) which consists of around 400k online biographies in English extracted from the Common Crawl data.
Dataset Splits Yes We randomly divide the 400k BIASBIOS dataset into the following three splits: 65% as training set, 10% as validation set, and 25% as test set. To simulate underrepresentation, we randomly subsample 1% of the images for one of the classes when training our CL models.
Hardware Specification No The paper mentions using a 'ResNet-34 backbone' and 'BERT' as model architectures, but does not provide any specific hardware details such as GPU or CPU models, or memory amounts used for experiments.
Software Dependencies No The paper mentions using 'SimCLR', 'SimSiam', and 'SimCSE' implementations and refers to external GitHub repositories for these, but it does not specify exact version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes Table 1: Training parameters for SimCSE. model parameters value batch size 64 sequence length 512 learning rate 1e-5 training epochs 1. Further experimental details are referenced to the supplementary codes: 'Please see simclr.py in supplementary codes for parameter values.' and 'Please see our jobs.py and main.py for the specification of hyperparameters, which are kept the same in both training cases with balanced and imbalanced datasets.'