Localizing Memorization in SSL Vision Encoders

Authors: Wenhao Wang, Adam Dziedzic, Michael Backes, Franziska Boenisch

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By performing a systematic study on localizing memorization with our two metrics on various encoder architectures (convolutional and transformer-based) trained on diverse vision datasets with contrastive and non-contrastive SSL frameworks, we make the following key discoveries:
Researcher Affiliation Academia Wenhao Wang1, Adam Dziedzic1, Michael Backes1, Franziska Boenisch1 1CISPA, Helmholtz Center for Information Security
Pseudocode No The paper includes mathematical formulas and algorithmic descriptions in prose, but it does not present any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code is attached as supplementary material.
Open Datasets Yes We base our experiments on Image Net ILSVRC-2012 [42], CIFAR10 [32], CIFAR100 [32], SVHN [40], and STL10 [18].
Dataset Splits No The paper mentions training epochs and datasets, and refers to 'early stopping' which implies the use of a validation set. However, it does not provide specific details on the validation data split (e.g., percentage or sample counts).
Hardware Specification Yes We finish all our experiments on two devices: a cloud server with four A100 GPUs and a local workstation with Intel 13700k CPU, Nvidia 4090 graphics card and 64GB of RAM
Software Dependencies No The paper mentions the use of SSL frameworks like MAE, Sim CLR, DINO, and Sim Siam. However, it does not specify version numbers for these frameworks, nor for underlying software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes Our experimental setup for training the encoders mainly follows [47] and we rely on their naming conventions and refer to the data points that are used to train encoder f, but not reference encoder g as candidate data points. In total, we use 50000 data points as training samples for CIFAR10, SVHN, and STL10 and 100000 for Image Net with 5000 candidate data points per dataset. The encoders evaluated in the paper are trained with batch size 1024, and trained 600 epochs for CIFAR10, SVHN, and STL10, and 300 epochs for Image Net. We set the batch size to 1024 for all our experiments and train for 600 epochs on CIFAR10, SVHN, and STL10, and for 300 epochs on Image Net. As a distance metric to measure representation alignment, we use the ℓ2 distance. We repeat all experiments with three independent seeds and report average and standard deviation. For reproducibility, we detail our full setup in Table 9 with the standard parameters that are used throughout the paper if not explicitly specified otherwise.