Visual Instruction Tuning with Polite Flamingo
Authors: Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a comprehensive evaluation comparing the resulting visual instruction-tuned model, which we called Clever Flamingo, with other multi-modal LLMs... Table 1: Performance comparison of with different multi-modal LLMs. ... Ablation Study ... We report the averaged NLI-based validation accuracy of in-domain (held-in) VQA datasets and out-of-distribution (held-out) VQA datasets, |
| Researcher Affiliation | Collaboration | Delong Chen1,2, Jianfeng Liu1, Wenliang Dai2, Baoyuan Wang1 1Xiaobing.AI 2Centre for Artificial Intelligence Research (CAi RE), Hong Kong University of Science and Technology |
| Pseudocode | No | The paper describes its methods and training pipeline using textual descriptions and flowcharts (e.g., 'Figure 3: Training pipeline of Polite Flamingo'), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code and dataset are available at https://github.com/Chen Delong1999/politeflamingo |
| Open Datasets | Yes | We utilize a total of 0.77M samples, which include all text-only instructions, LLa VA instructions, and 10% samples (97k) from PF-1M, and trained the model for a single epoch. ... The adopted datasets can be roughly divided into two main groups: captioning datasets, which task the model with providing detailed descriptions of image content, and VQA datasets, which require the model to accurately answer specific queries. We adopted a total of 37 datasets, see the appendix for a detailed summarization. (Citations to public datasets like COCO (Chen et al. 2015), VQA-v2 (Goyal et al. 2017), LLaVA (Liu et al. 2023b), Ultra Chat (Ding et al. 2023) are provided throughout the paper.) |
| Dataset Splits | Yes | We report the averaged NLI-based validation accuracy of in-domain (held-in) VQA datasets and out-of-distribution (held-out) VQA datasets... We used a reward model to evaluate the politeness of model responses on a total of 52k samples sourced from the validation/test split of a collection of visionlanguage downstream datasets. |
| Hardware Specification | No | The paper mentions 'Polite Flamingo can be run on consumer GPUs: BF-16 inference roughly takes 18 GB GPU memory.' but does not specify the exact GPU model, CPU, or other detailed hardware specifications used for running the experiments or training. |
| Software Dependencies | No | The paper mentions software components and models like 'Open Flamingo-9B', 'Guanaco-7B', 'LLa MA-7B', and the 'NLPAUG library' but does not specify exact version numbers for any of these dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | The model is trained with a large context window of 1024 tokens. ... To enhance training efficiency, we use a smaller context window of 196 tokens. ... Stage 3 uses the same setting as Stage 1, but we adjust the learning rate to 10 lower. ... the model is trained for a single epoch. ... when skipping stage 1 and directly going into stage 2 from vanilla Open Flamingo-9B, the OOD generalization ability further dropped. |