Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Making Large Vision Language Models to Be Good Few-Shot Learners
Authors: Fan Liu, Wenwen Cai, Jian Huo, Chuanyi Zhang, Delong Chen, Jun Zhou
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proven beneficial for training-free LVLMs. |
| Researcher Affiliation | Academia | 1Hohai University, China 2Hong Kong University of Science and Technology, China 3Griffith University, Australia |
| Pseudocode | No | The paper describes steps under '3.3 Attribute Description Generation' as 'The detailed steps are as follows: Step 1: Adaptive Attribute Selection. Step 2: Automatic Prompt Generation. Step 3: Attribute Specific Description Generation. Step 4: Global Attribute Description Generation.' These are descriptive steps and not formatted as pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code https://github.com/HUOUO7/MLVLM-FSL |
| Open Datasets | Yes | For instruction tuning, we selected 13 datasets from ELEVATER (Li et al. 2022). These datasets span various domains such as remote sensing, scene recognition, stripe recognition, and fine-grained classification. For inference, we evaluate our method on eight established FSL datasets: Mini Image Net (MINI) (Vinyals et al. 2016), CIFAR-FS (CIFAR) (Bertinetto et al. 2019), Tiered Image Net (TIERED) (Triantafillou et al. 2018), CUB (Wah et al. 2011), Stanford Dogs (Dogs) (Khosla et al. 2011), FGVC-Aircraft (FGVC) (Maji et al. 2013), Oxford 102 Flower (Flowers) (Nilsback and Zisserman 2008), and Stanford Cars (Cars) (Krause et al. 2013). We also test our methods on ISEKAI (Tai et al. 2024), a fully synthetic dataset generated with Midjourney. |
| Dataset Splits | Yes | For these datasets, we randomly split the classes into base and novel sets with a 7:3 ratio, using the base sets for fine-tuning. During testing, the support set S = {(xi, yi)N K i=0 } is randomly selected from Dnovel, which includes N classes, each containing K samples. The model must then accurately classify the images in the query set Q = {(xi, yi)N M i=0 } into one of the N classes present in the support set S, where M is the number of query samples per class. This classification task is generally referred to as an N-way K-shot task. |
| Hardware Specification | No | The paper mentions using the quantized version Qwen-VL-Chat-Int4 and Q-LoRA for fine-tuning but does not specify any hardware details like GPU/CPU models, memory, or specific computing environments used for running the experiments. |
| Software Dependencies | Yes | Additionally, we directly used the frozen SBERT (all-Mini LML6-v2) (Reimers and Gurevych 2019) as the text encoder used in the semantic aided inference step to measure the similarity between sentences, which had been trained on a 1B sentence pair data set and could effectively capture the semantic information of sentence vectors. |
| Experiment Setup | Yes | Specifically, the learning process utilized a cosine learning rate scheduler with a base learning rate of 1 10 5 and a warm-up ratio of 0.01. Optimization was performed using the Adam optimizer, with a weight decay of 0.1 and a β2 parameter set to 0.95, which ensured stability in convergence. The maximum sequence length of the model was set to 2048 tokens to effectively handle long sequences. |