On the Surrogate Gap between Contrastive and Supervised Losses

Authors: Han Bao, Yoshihiro Nagano, Kento Nozawa

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify that our theory is consistent with experiments on synthetic, vision, and language datasets.
Researcher Affiliation Academia 1The University of Tokyo, Tokyo, Japan 2RIKEN AIP, Tokyo, Japan. Correspondence to: Han Bao (currently with Kyoto University) <bao@i.kyoto-u.ac.jp>.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The experimental codes to reproduce all figures in the paper are available at https://github.com/nzw0301/gap-contrastive-and-supervised-losses.
Open Datasets Yes We used the same datasets as Arora et al. (2019): CIFAR-100 (Krizhevsky, 2009) and Wiki-3029 (Arora et al., 2019) datasets, along with CIFAR-10 (Krizhevsky, 2009) dataset.
Dataset Splits Yes We treated 10% training samples as a validation dataset by sampling class uniformly. We used the original test dataset for testing. ... we split the dataset into 70%/10%/20% train/validation/test datasets, respectively.
Hardware Specification Yes We implemented our experimental code by using Py Torch (Paszke et al., 2019) s distributed data-parallel training (Li et al., 2020) on 8 NVIDIA A100 GPUs provided by the internal cluster.
Software Dependencies Yes We used Adam (Kingma & Ba, 2015) optimizer... provided by Py Torch (Paszke et al., 2019). ... We also used scikit-learn (Pedregosa et al., 2011) ... matplotlib (Hunter, 2007) and seaborn (Waskom, 2021) via pandas (Reback et al., 2020) ... hydra (Yadan, 2019) and experimental results using Weights & Biases (Biewald, 2020). For effective parallelized execution of our experimental codes, we use GNU Parallel (Tange, 2021).
Experiment Setup Yes We used Adam (Kingma & Ba, 2015) optimizer with the weight decay of coefficient 0.01 to all parameters. The mini-batch size was set to B = 1 024 and the number of epochs was 300. The learning rate was set to 0.01 with Reduce LROn Plateau scheduler (patience: 10 epochs)... We used LARC (You et al., 2017) optimizer wrapping the momentum SGD, whose momentum term was 0.9. We applied weights decay of coefficient 10 4 to all parameters except for all bias terms and batch norm s parameters. The base learning rate was initialized at lr B, where lr {2, 4, 6} 1/64 and mini-batch size B = 1 024... The number of epochs was 2 000.