Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Visual Instruction Tuning with Polite Flamingo
Authors: Delong Chen, Jianfeng Liu, Wenliang Dai, Baoyuan Wang
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a comprehensive evaluation comparing the resulting visual instruction-tuned model, which we called Clever Flamingo, with other multi-modal LLMs... Table 1: Performance comparison of with different multi-modal LLMs. ... Ablation Study ... We report the averaged NLI-based validation accuracy of in-domain (held-in) VQA datasets and out-of-distribution (held-out) VQA datasets, |
| Researcher Affiliation | Collaboration | Delong Chen1,2, Jianfeng Liu1, Wenliang Dai2, Baoyuan Wang1 1Xiaobing.AI 2Centre for Artificial Intelligence Research (CAi RE), Hong Kong University of Science and Technology |
| Pseudocode | No | The paper describes its methods and training pipeline using textual descriptions and flowcharts (e.g., 'Figure 3: Training pipeline of Polite Flamingo'), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code and dataset are available at https://github.com/Chen Delong1999/politeflamingo |
| Open Datasets | Yes | We utilize a total of 0.77M samples, which include all text-only instructions, LLa VA instructions, and 10% samples (97k) from PF-1M, and trained the model for a single epoch. ... The adopted datasets can be roughly divided into two main groups: captioning datasets, which task the model with providing detailed descriptions of image content, and VQA datasets, which require the model to accurately answer specific queries. We adopted a total of 37 datasets, see the appendix for a detailed summarization. (Citations to public datasets like COCO (Chen et al. 2015), VQA-v2 (Goyal et al. 2017), LLaVA (Liu et al. 2023b), Ultra Chat (Ding et al. 2023) are provided throughout the paper.) |
| Dataset Splits | Yes | We report the averaged NLI-based validation accuracy of in-domain (held-in) VQA datasets and out-of-distribution (held-out) VQA datasets... We used a reward model to evaluate the politeness of model responses on a total of 52k samples sourced from the validation/test split of a collection of visionlanguage downstream datasets. |
| Hardware Specification | No | The paper mentions 'Polite Flamingo can be run on consumer GPUs: BF-16 inference roughly takes 18 GB GPU memory.' but does not specify the exact GPU model, CPU, or other detailed hardware specifications used for running the experiments or training. |
| Software Dependencies | No | The paper mentions software components and models like 'Open Flamingo-9B', 'Guanaco-7B', 'LLa MA-7B', and the 'NLPAUG library' but does not specify exact version numbers for any of these dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | The model is trained with a large context window of 1024 tokens. ... To enhance training efficiency, we use a smaller context window of 196 tokens. ... Stage 3 uses the same setting as Stage 1, but we adjust the learning rate to 10 lower. ... the model is trained for a single epoch. ... when skipping stage 1 and directly going into stage 2 from vanilla Open Flamingo-9B, the OOD generalization ability further dropped. |