Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP

Authors: Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly, Joshua Susskind

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our generated Aggregate-and-Adapted Prompt Embedding (or AAPE) prove highly effective on various downstream vision-language tasks. We show AAPE is a new state-of-the-art for few-shot image classification on 11 datasets, under different OOD generalization settings. AAPE can also generalize zero-shot to tasks like image-to-text retrieval, image captioning and VQA. When finetuned on these tasks, AAPE achieves even better performance than SOTA vision-language models (e.g., MAGMA [11]) whose entire image and text networks are fine-tuned at large cost.
Researcher Affiliation Industry Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly & Josh Susskind Apple {chen-huang,sseto,abnar,grangier,njaitly,jsusskind}@apple.com
Pseudocode No The paper describes the architecture and steps of the proposed method in text and figures, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code No Neur IPS Paper Checklist - Question 5: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Section 4 describes the experimental setup, dataset and implementation details to run and reproduce experiments.
Open Datasets Yes Datasets. We use 11 datasets: Image Net [10], Caltech101 [12], Oxford Pets [38], Stanford Cars [23], Flowers102 [37], Food101 [4], FGVC-Aircraft [32], SUN397 [54], UCF101 [47], DTD [9] and Euro SAT [14]. These datasets cover a wide range of generic objects and scenes, fine-grained object classes, as well as special domains with textural and satellite images. ... we perform prompt learning on COCO dataset [31]
Dataset Splits No The paper mentions 'using 1, 2, 4, 8 and 16 shots per class for training (default 16), and the full testset for evaluation' and discusses 'base-to-new class generalization setting' for training and testing. However, it does not explicitly state details for a separate 'validation' split.
Hardware Specification Yes Appendix C (Table 7) provides a detailed analysis of the compute cost measured on Nvidia V100 GPU, where all prompt learners are evaluated for fair efficiency comparisons.
Software Dependencies No The paper mentions using specific models like 'CLIP vision backbone (Vi T-B/16)' and 'GPT-3', but it does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Implementation. We follow the prompt learning details in [67], including the CLIP vision backbone (Vi T-B/16), learning rate schedule and the number of epochs for each dataset. ... L = λLdistill + Ltask, where Ltask = log p (y = c | x) , (4) and λ = 5 is a weighting parameter.