Democratizing Fine-grained Visual Recognition with Large Language Models

Authors: Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, Elisa Ricci

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted experiments on several fine-grained datasets that include Caltech UCSD Bird-200 (Wah et al., 2011), Stanford Car-196 (Khosla et al., 2011), Stanford Dog-120 (Krause et al., 2013), Flower-102 (Nilsback & Zisserman, 2008), and Oxford-IIIT Pet-37 (Parkhi et al., 2012). [...] As shown in Tab. 1, our Fine R system outperforms the second-best method (BLIP-2) by a substantial margin, giving improvements of +9.8% in c ACC and +5.7% in s ACC averaged on the five datasets.
Researcher Affiliation Academia Mingxuan Liu1, Subhankar Roy4, Wenjing Li3,6 , Zhun Zhong3,5 , Nicu Sebe1, Elisa Ricci1,2 1 University of Trento, Trento, Italy 2 Fondazione Bruno Kessler, Trento, Italy 3 Hefei University of Technology, Hefei, China 4 University of Aberdeen, Aberdeen, UK 5 University of Nottingham, Nottingham, UK 6 University of Leeds, Leeds, UK
Pseudocode No The paper describes its pipeline with detailed text and figures (e.g., Figure 3), but it does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code is available at https://projfiner.github.io. [...] Upon publication, we will release all essential resources for reproducing the main experimental results in the main paper, including code, prompts, datasets and the data splits.
Open Datasets Yes We have conducted experiments on several fine-grained datasets that include Caltech UCSD Bird-200 (Wah et al., 2011), Stanford Car-196 (Khosla et al., 2011), Stanford Dog-120 (Krause et al., 2013), Flower-102 (Nilsback & Zisserman, 2008), and Oxford-IIIT Pet-37 (Parkhi et al., 2012). [...] We plan to publicly release this Pokemon dataset along with our code.
Dataset Splits No The paper mentions 'training split' and 'test split' but does not explicitly define or specify a 'validation split' with percentages, sample counts, or a clear methodology for its use (e.g., for hyperparameter tuning) in the main text. While sensitivity analyses for hyperparameters are discussed in the appendix, the validation split used for this is not specified.
Hardware Specification No The paper does not provide specific details regarding the hardware used for experiments, such as GPU models, CPU types, or memory specifications. It mentions the models used (BLIP-2 Flan-T5xxl, Chat GPT gpt-3.5-turbo, CLIP Vi T-B/16) but not the machines they ran on.
Software Dependencies No The paper lists the main models used: BLIP-2 (Li et al., 2023) Flan-T5xxl, Chat GPT (Open AI, 2022) gpt-3.5-turbo, CLIP (Radford et al., 2021) Vi T-B/16, and mentions generalizability with Vicuna-13B (Chiang et al., 2023). However, it does not specify version numbers for general software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes The multi-modal fusion hyperparameter and exemplar augmentation times are set to α = 0.7 and K = 10. [...] We elaborate prompt design detail in App. B.2.