Multimodal Few-Shot Learning with Frozen Language Models

Authors: Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are designed to quantify three capacities that should be characteristic of a Multi Modal Few-Shot Learner: rapid adaptation to new tasks, fast access to general knowledge and fast binding of visual and linguistic elements. We quantify these capabilities on a range of existing and new benchmarks, paving the way for future analysis of these capabilities.
Researcher Affiliation Collaboration Maria Tsimpoukelli Deep Mind mrts@deepmind.com Jacob Menick Deep Mind University College London jmenick@deepmind.com Serkan Cabi Deep Mind cabi@deepmind.com S. M. Ali Eslami Deep Mind aeslami@deepmind.com Oriol Vinyals Deep Mind vinyals@deepmind.com Felix Hill Deep Mind felixhill@deepmind.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The Open-Ended mini Imagenet, Real-Name mini Imagenet, Fast-VQA and Guided-VQA evaluation sets are available to download at https://fh295.github.io/frozen.html. This link is for evaluation datasets, not the source code for the methodology.
Open Datasets Yes We use a 7 billion parameter transformer trained on the public dataset C4 [31] previous work has shown that the multi-billion parameter scale is sufficient to exhibit the key capacities we are interested in studying [30, 34]. During training, we update only the parameters φ of the vision encoder using paired image-caption data from the Conceptual Captions dataset [37].
Dataset Splits Yes We do early stopping on the validation set perplexity which usually reaches an optimum just after a single epoch with batch size 128. We evaluate on the VQAv2 [10] validation set.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or TPU versions) used for running experiments are mentioned in the paper.
Software Dependencies No The paper mentions software components like 'Sentence Piece tokenizer' and 'Adam optimizer' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes All experiments used the Adam optimizer with β1 = 0.9 and β2 = 0.95 and a constant learning rate of 3e-4 unless otherwise noted. We do early stopping on the validation set perplexity which usually reaches an optimum just after a single epoch with batch size 128. We experimented using different number of tokens k, specifically 1, 2 and 4 and found that 2 performs best, though certainly this would be sensitive to other architectural details. We operate on 224 224 images at both train and test-time. Images which are not square are first padded with zeroes to square and then resized to 224 224.