Flamingo: a Visual Language Model for Few-Shot Learning
Authors: Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karén Simonyan
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data. |
| Researcher Affiliation | Industry | Corresponding authors: {jalayrac|jeffdonahue|paulineluc|miech}@deepmind.com |
| Pseudocode | Yes | We provide an illustration, more architectural details, and pseudo-code in Appendix A.1.1. |
| Open Source Code | No | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The code and the data are proprietary. |
| Open Datasets | No | The paper explicitly states in the checklist (item 3a): 'The code and the data are proprietary.' While it mentions using and collecting datasets like ALIGN, M3W, LTIP, and VTP, the overall mixture used for training is not publicly available for direct access, as indicated by the proprietary statement. |
| Dataset Splits | Yes | For the DEV benchmarks that are used both to validate design decisions and hyperparameters, as well as to report final performance, we therefore use four subsets: validation support, validation query, test support and test query. For other benchmarks, we need only the latter two. We report in Appendix B.1.4 how we form these subsets. |
| Hardware Specification | Yes | Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Details can be found in Appendix B.1.2. In short, our largest run was trained on 1536 TPU chips for 15 days. |
| Software Dependencies | No | The paper mentions software like JAX [8] and Haiku [40] by citing them, which includes their publication year. However, it does not provide specific version numbers for these software dependencies in the main text or appendix, which is required for a reproducible description. |
| Experiment Setup | Yes | Evaluation hyperparameters and additional details are given in Appendix B.1.5. ... We keep all evaluation hyperparameters fixed across all benchmarks. ... Depending on the task, we use four few-shot prompt templates we describe in more detail in Appendix B.1.5. |