Déjà Vu Memorization in Vision–Language Models

Authors: Bargav Jayaraman, Chuan Guo, Kamalika Chaudhuri

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate d ej a vu memorization at both sample and population level, and show that it is significant for Open CLIP trained on as many as 50M image-caption pairs.
Researcher Affiliation Industry Bargav Jayaraman FAIR, Meta California, USA bargav@meta.com Chuan Guo FAIR, Meta California, USA chuanguo@meta.com Kamalika Chaudhuri FAIR, Meta California, USA kamalika@meta.com
Pseudocode Yes Algorithm 1 k-Nearest Neighbor Test
Open Source Code Yes The code is available here: https://github.com/facebookresearch/ VLMDeja Vu.
Open Datasets Yes We use Image Net [Yang et al., 2022] for which the license can be found at https://www. image-net.org/download.php. We use a filtered version of LAION [Radenovic et al., 2023] (which we call filtered LAION) for which licensing information can be found at https: //github.com/facebookresearch/diht/blob/main/LICENSE. The licensing information for the MS COCO data set [Lin et al., 2014] that we use can be found at https: //cocodataset.org/#termsofuse. We also use Shutterstock data set which is a private licensed data set consisting of 239M image-caption pairs.
Dataset Splits No A small portion of the remaining 3M data is used as a hold-out set for hyper-parameter tuning during model training.
Hardware Specification Yes For filtered LAION experiments, we use 256 Nvidia Quadro GP100 GPUs with 16GB VRAM to train the models in parallel with an effective batch size of 16 384. ... For Shutterstock experiments, we use 32 Nvidia A100 GPUs with 80GB VRAM to train the models in parallel with an effective batch size of 32 768. ... All the model training runs use 512GB RAM...
Software Dependencies No For our experiments we use Open CLIP [Ilharco et al., 2021] to train the models. ... We use Adam [Kingma and Ba, 2017] optimizer... For object annotations, we use Detic [Zhou et al., 2022]
Experiment Setup Yes We use the Vi T-B-32 CLIP model architecture consisting of around 151M trainable parameters and train the models for 200 epochs using Adam [Kingma and Ba, 2017] optimizer with cosine learning rate scheduler and a learning rate of 0.0005. ... We set the weight decay to 0.1 and use 1000 warmup steps for the learning rate scheduler. ... We set the weight decay to 0.2 and warmup to 2000 steps.