Memorization in Self-Supervised Learning Improves Downstream Generalization
Authors: Wenhao Wang, Muhammad Ahmad Kaleem, Adam Dziedzic, Michael Backes, Nicolas Papernot, Franziska Boenisch
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations both known in supervised learning as regularization techniques that reduce overfitting still significant fractions of training data points experience high memorization. |
| Researcher Affiliation | Collaboration | Wenhao Wang 1, Muhammad Ahmad Kaleem 2, Adam Dziedzic 1, Michael Backes1, Nicolas Papernot2, Franziska Boenisch 1 1CISPA, 2University of Toronto and Vector Institute Correspondence to: adam.dziedzic@sprintml.com and boenisch@cispa.de. Part of the work was done while the authors were at the University of Toronto and the Vector Institute. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the methodology or links to a code repository. |
| Open Datasets | Yes | We train these encoders for 300 epochs with Image Net ILSVRC-2012 (Russakovsky et al., 2015) and 600 epoch with CIFAR10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009) , SVHN (Netzer et al., 2011), and STL10(Coates et al., 2011). |
| Dataset Splits | Yes | For example, in CIFAR10, we use 80% of the train data, i.e., 40000 samples as shared training data SS between encoders f and g. The next 10% of samples, i.e., 5000 are used as candidates SC to evaluate memorization. We add those to the training data of f only, and the remaining 10%, which is another 5000 samples, are used as an independent set SI, on which we do not train f but only g. |
| Hardware Specification | Yes | The Image Net and STL10 based encoders are trained on a server with 2 NVIDIA-A100 GPUs. CIFAR10, CIFAR100, and SVHN-based encoders and all linear probing evaluation are performed on a 4090 GPU server with an Intel 13700K processor and 64G RAM. |
| Software Dependencies | No | The paper mentions software components like optimizers (AdamW, LARS, SGD) and learning rate schedules (Cos. Decay), but does not specify version numbers for these or other relevant software libraries/frameworks. |
| Experiment Setup | Yes | We set the batch size to 1024 for all our experiments and train for 600 epochs on CIFAR10, SVHN, and STL10, and for 300 epochs on Image Net. ... Table 6: Experimental Setup. We provide details on our setup for encoder training and evaluation. Training Epoch (Imagenet / others) 300 / 600 Warm-up Epoch (Imagenet / others) 30 / 60 Batch Size 2048 4096 1024 256 Optimizer AdamW LARS AdamW SGD Learning rate 1.2e-3 4.8 2e-3 3e-3 1.6 Learning rate Schedule Cos. Decay |