A Message Passing Perspective on Learning Dynamics of Contrastive Learning

Authors: Yifei Wang, Qi Zhang, Tianqi Du, Jiansheng Yang, Zhouchen Lin, Yisen Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that both techniques leads to clear benefits on benchmark datasets. In turn, their empirical successes also help verify the validness of our established connection between CL and MP-GNNs.
Researcher Affiliation Academia 1School of Mathematical Sciences, Peking University 2National Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 3Institute for Artificial Intelligence, Peking University 4Peng Cheng Laboratory
Pseudocode No In this work, we transfer this idea to the contrastive learning scenario, that is, to alleviate feature collapse by combining features from multiple previous epochs. Specifically, we create a memory bank M to store the latest s-epoch features from the same natural sample x. At the t-th epoch, for x generated from x, we have stored old features of it as z(t 1) x , z(t 2) x , . . . , z(t r) x where r = min(s, t). Then we replace the original positive feature fθ(x+) with the aggregation of the old features in the memory bank, where we simply adopt the sum aggregation for simplicity, i.e., z x = 1 r Pr i=1 z(t i) x . Next, we optimize the encoder network fθ with the following multi-stage alignment loss alone, Lmulti align(θ) = E x Ex| x(fθ(x) z x). (13) Afterwards, we will push the newest feature z = fθ(x) to the memory bank, and drop the oldest one if exceeded. In this way, we could align the new feature with multiple old features that are less collapsed, which, in turn, also prevents the collapse of the updated features.
Open Source Code Yes The code is available at https://github.com/PKU-ML/Message-Passing-Contrastive-Learning.
Open Datasets Yes Table 1: Comparison of linear probing accuracy (%) of contrastive learning methods and their variants inspired by MP-GNNs, evaluated on three datasets, CIFAR-10 (C10), CIFAR-100 (C100), and Image Net-100 (IN100), and two backbone networks: Res Net-18 (RN18) and Res Net-50 (RN50).
Dataset Splits No We pretrain the encoder for 100 epochs on Image Net-100 and for 200 epochs on CIFAR-10 and CIFAR100. After the pretraining process, we train a linear classifier following the frozen backbones using the SGD optimizer.
Hardware Specification No No specific hardware information found in the paper.
Software Dependencies No Specifically, we use the SGD optimizer with 256 batch size and 5e-5 weight decay. We use the LARS optimizer with cosine annealed learning rate schedule and 512 batch size.
Experiment Setup Yes We train Res Net-18 and Res Net-50 backbones for 200 epochs with the default hyperparameters of Sim Siam (Chen & He, 2021). Specifically, we use the SGD optimizer with 256 batch size and 5e-5 weight decay. We pretrain the encoder for 100 epochs on Image Net-100 and for 200 epochs on CIFAR-10 and CIFAR100. We use the LARS optimizer with cosine annealed learning rate schedule and 512 batch size. After the pretraining process, we train a linear classifier following the frozen backbones using the SGD optimizer. To obtain a more accurate estimation of semantic similarity, we use the features before the projection layer and warmup the model for 30 epochs.