On the Surrogate Gap between Contrastive and Supervised Losses
Authors: Han Bao, Yoshihiro Nagano, Kento Nozawa
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify that our theory is consistent with experiments on synthetic, vision, and language datasets. |
| Researcher Affiliation | Academia | 1The University of Tokyo, Tokyo, Japan 2RIKEN AIP, Tokyo, Japan. Correspondence to: Han Bao (currently with Kyoto University) <bao@i.kyoto-u.ac.jp>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The experimental codes to reproduce all figures in the paper are available at https://github.com/nzw0301/gap-contrastive-and-supervised-losses. |
| Open Datasets | Yes | We used the same datasets as Arora et al. (2019): CIFAR-100 (Krizhevsky, 2009) and Wiki-3029 (Arora et al., 2019) datasets, along with CIFAR-10 (Krizhevsky, 2009) dataset. |
| Dataset Splits | Yes | We treated 10% training samples as a validation dataset by sampling class uniformly. We used the original test dataset for testing. ... we split the dataset into 70%/10%/20% train/validation/test datasets, respectively. |
| Hardware Specification | Yes | We implemented our experimental code by using Py Torch (Paszke et al., 2019) s distributed data-parallel training (Li et al., 2020) on 8 NVIDIA A100 GPUs provided by the internal cluster. |
| Software Dependencies | Yes | We used Adam (Kingma & Ba, 2015) optimizer... provided by Py Torch (Paszke et al., 2019). ... We also used scikit-learn (Pedregosa et al., 2011) ... matplotlib (Hunter, 2007) and seaborn (Waskom, 2021) via pandas (Reback et al., 2020) ... hydra (Yadan, 2019) and experimental results using Weights & Biases (Biewald, 2020). For effective parallelized execution of our experimental codes, we use GNU Parallel (Tange, 2021). |
| Experiment Setup | Yes | We used Adam (Kingma & Ba, 2015) optimizer with the weight decay of coefficient 0.01 to all parameters. The mini-batch size was set to B = 1 024 and the number of epochs was 300. The learning rate was set to 0.01 with Reduce LROn Plateau scheduler (patience: 10 epochs)... We used LARC (You et al., 2017) optimizer wrapping the momentum SGD, whose momentum term was 0.9. We applied weights decay of coefficient 10 4 to all parameters except for all bias terms and batch norm s parameters. The base learning rate was initialized at lr B, where lr {2, 4, 6} 1/64 and mini-batch size B = 1 024... The number of epochs was 2 000. |