Measuring Dejavu Memorization Efficiently

Authors: Narine Kokhlikyan, Bargav Jayaraman, Florian Bordes, Chuan Guo, Kamalika Chaudhuri

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that different ways of measuring memorization yield very similar aggregate results. We also find that open-source models typically have lower aggregate memorization than similar models trained on a subset of the data.
Researcher Affiliation Industry Narine Kokhlikyan FAIR at Meta Bargav Jayaraman FAIR at Meta Florian Bordes FAIR at Meta Chuan Guo FAIR at Meta Kamalika Chaudhuri FAIR at Meta
Pseudocode No The paper does not contain any blocks explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code Yes The code is available both for vision and vision language models.
Open Datasets Yes We conduct all our image representation learning experiments on Image Net Deng et al. [2009] dataset.
Dataset Splits Yes We use 300k (300 per class) examples to train the reference models to learn dataset-level correlations. We measure memorization accuracy on an additional disjoint set of 300k images. For the two model tests, these images are included in the training set of the target models, but not the reference models. Finally, we use another additional distinct 500k images to predict the nearest foreground object given the representation of a background crop through KNN.
Hardware Specification Yes The reference models are trained on a single machine with 8 Nvidia v100 GPUs, 32GB per GPU using 128 batch size.
Software Dependencies No We train CLIP models using the Open CLIP Ilharco et al. [2021] framework
Experiment Setup Yes We train our models for 200 epochs with a learning rate of 0.0005 and a warmup of 2000 steps for cosine learning rate scheduler. Our training runs use 512GB RAM and use 32 Nvidia A100 GPUs with a global batch size of 32 768.