Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
Authors: Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly, Joshua Susskind
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our generated Aggregate-and-Adapted Prompt Embedding (or AAPE) prove highly effective on various downstream vision-language tasks. We show AAPE is a new state-of-the-art for few-shot image classification on 11 datasets, under different OOD generalization settings. AAPE can also generalize zero-shot to tasks like image-to-text retrieval, image captioning and VQA. When finetuned on these tasks, AAPE achieves even better performance than SOTA vision-language models (e.g., MAGMA [11]) whose entire image and text networks are fine-tuned at large cost. |
| Researcher Affiliation | Industry | Chen Huang, Skyler Seto, Samira Abnar, David Grangier, Navdeep Jaitly & Josh Susskind Apple {chen-huang,sseto,abnar,grangier,njaitly,jsusskind}@apple.com |
| Pseudocode | No | The paper describes the architecture and steps of the proposed method in text and figures, but it does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | No | Neur IPS Paper Checklist - Question 5: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Section 4 describes the experimental setup, dataset and implementation details to run and reproduce experiments. |
| Open Datasets | Yes | Datasets. We use 11 datasets: Image Net [10], Caltech101 [12], Oxford Pets [38], Stanford Cars [23], Flowers102 [37], Food101 [4], FGVC-Aircraft [32], SUN397 [54], UCF101 [47], DTD [9] and Euro SAT [14]. These datasets cover a wide range of generic objects and scenes, fine-grained object classes, as well as special domains with textural and satellite images. ... we perform prompt learning on COCO dataset [31] |
| Dataset Splits | No | The paper mentions 'using 1, 2, 4, 8 and 16 shots per class for training (default 16), and the full testset for evaluation' and discusses 'base-to-new class generalization setting' for training and testing. However, it does not explicitly state details for a separate 'validation' split. |
| Hardware Specification | Yes | Appendix C (Table 7) provides a detailed analysis of the compute cost measured on Nvidia V100 GPU, where all prompt learners are evaluated for fair efficiency comparisons. |
| Software Dependencies | No | The paper mentions using specific models like 'CLIP vision backbone (Vi T-B/16)' and 'GPT-3', but it does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Implementation. We follow the prompt learning details in [67], including the CLIP vision backbone (Vi T-B/16), learning rate schedule and the number of epochs for each dataset. ... L = λLdistill + Ltask, where Ltask = log p (y = c | x) , (4) and λ = 5 is a weighting parameter. |