Visual Instruction Tuning with Polite Flamingo

Authors: Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform a comprehensive evaluation comparing the resulting visual instruction-tuned model, which we called Clever Flamingo, with other multi-modal LLMs... Table 1: Performance comparison of with different multi-modal LLMs. ... Ablation Study ... We report the averaged NLI-based validation accuracy of in-domain (held-in) VQA datasets and out-of-distribution (held-out) VQA datasets,
Researcher Affiliation Collaboration Delong Chen1,2, Jianfeng Liu1, Wenliang Dai2, Baoyuan Wang1 1Xiaobing.AI 2Centre for Artificial Intelligence Research (CAi RE), Hong Kong University of Science and Technology
Pseudocode No The paper describes its methods and training pipeline using textual descriptions and flowcharts (e.g., 'Figure 3: Training pipeline of Polite Flamingo'), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code and dataset are available at https://github.com/Chen Delong1999/politeflamingo
Open Datasets Yes We utilize a total of 0.77M samples, which include all text-only instructions, LLa VA instructions, and 10% samples (97k) from PF-1M, and trained the model for a single epoch. ... The adopted datasets can be roughly divided into two main groups: captioning datasets, which task the model with providing detailed descriptions of image content, and VQA datasets, which require the model to accurately answer specific queries. We adopted a total of 37 datasets, see the appendix for a detailed summarization. (Citations to public datasets like COCO (Chen et al. 2015), VQA-v2 (Goyal et al. 2017), LLaVA (Liu et al. 2023b), Ultra Chat (Ding et al. 2023) are provided throughout the paper.)
Dataset Splits Yes We report the averaged NLI-based validation accuracy of in-domain (held-in) VQA datasets and out-of-distribution (held-out) VQA datasets... We used a reward model to evaluate the politeness of model responses on a total of 52k samples sourced from the validation/test split of a collection of visionlanguage downstream datasets.
Hardware Specification No The paper mentions 'Polite Flamingo can be run on consumer GPUs: BF-16 inference roughly takes 18 GB GPU memory.' but does not specify the exact GPU model, CPU, or other detailed hardware specifications used for running the experiments or training.
Software Dependencies No The paper mentions software components and models like 'Open Flamingo-9B', 'Guanaco-7B', 'LLa MA-7B', and the 'NLPAUG library' but does not specify exact version numbers for any of these dependencies, which is required for reproducibility.
Experiment Setup Yes The model is trained with a large context window of 1024 tokens. ... To enhance training efficiency, we use a smaller context window of 196 tokens. ... Stage 3 uses the same setting as Stage 1, but we adjust the learning rate to 10 lower. ... the model is trained for a single epoch. ... when skipping stage 1 and directly going into stage 2 from vanilla Open Flamingo-9B, the OOD generalization ability further dropped.