Multimodal Few-Shot Learning with Frozen Language Models
Authors: Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are designed to quantify three capacities that should be characteristic of a Multi Modal Few-Shot Learner: rapid adaptation to new tasks, fast access to general knowledge and fast binding of visual and linguistic elements. We quantify these capabilities on a range of existing and new benchmarks, paving the way for future analysis of these capabilities. |
| Researcher Affiliation | Collaboration | Maria Tsimpoukelli Deep Mind mrts@deepmind.com Jacob Menick Deep Mind University College London jmenick@deepmind.com Serkan Cabi Deep Mind cabi@deepmind.com S. M. Ali Eslami Deep Mind aeslami@deepmind.com Oriol Vinyals Deep Mind vinyals@deepmind.com Felix Hill Deep Mind felixhill@deepmind.com |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The Open-Ended mini Imagenet, Real-Name mini Imagenet, Fast-VQA and Guided-VQA evaluation sets are available to download at https://fh295.github.io/frozen.html. This link is for evaluation datasets, not the source code for the methodology. |
| Open Datasets | Yes | We use a 7 billion parameter transformer trained on the public dataset C4 [31] previous work has shown that the multi-billion parameter scale is sufficient to exhibit the key capacities we are interested in studying [30, 34]. During training, we update only the parameters φ of the vision encoder using paired image-caption data from the Conceptual Captions dataset [37]. |
| Dataset Splits | Yes | We do early stopping on the validation set perplexity which usually reaches an optimum just after a single epoch with batch size 128. We evaluate on the VQAv2 [10] validation set. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or TPU versions) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions software components like 'Sentence Piece tokenizer' and 'Adam optimizer' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | All experiments used the Adam optimizer with β1 = 0.9 and β2 = 0.95 and a constant learning rate of 3e-4 unless otherwise noted. We do early stopping on the validation set perplexity which usually reaches an optimum just after a single epoch with batch size 128. We experimented using different number of tokens k, specifically 1, 2 and 4 and found that 2 performs best, though certainly this would be sensitive to other architectural details. We operate on 224 224 images at both train and test-time. Images which are not square are first padded with zeroes to square and then resized to 224 224. |