Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification
Authors: Reza Esfandiarpoor, Stephen Bach
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, Fu DD consistently outperforms generic description ensembles and naive LLM-generated descriptions on 12 datasets. |
| Researcher Affiliation | Academia | Reza Esfandiarpoor & Stephen H. Bach Department of Computer Science Brown University Providence, RI 02906, USA {reza esfandiarpoor,stephen bach}@brown.edu |
| Pseudocode | No | The paper describes its method in prose and does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https: //github.com/Bats Research/fudd |
| Open Datasets | Yes | We evaluate our method on 12 image recognition datasets. We use the CUB2002011 (Wah et al., 2011)..., Describable Textures Dataset (DTD) (Cimpoi et al., 2014)..., Euro SAT (Helber et al., 2019)..., FGVCAircraft (Maji et al., 2013)..., Flowers102 (Nilsback & Zisserman, 2008), Food101 (Bossard et al., 2014), Image Net (Deng et al., 2009), Image Net V2 (Kornblith et al., 2019), Oxford IIIT Pets (Parkhi et al., 2012), Places365 (Zhou et al., 2017), Stanford Cars (Krause et al., 2013), and Stanford Dogs (Khosla et al., 2011) datasets. |
| Dataset Splits | No | The paper uses well-known datasets for evaluation, but it does not explicitly provide details on how the data was split into training, validation, and test sets for its own experiments, nor does it specify the percentages or counts for these splits. The approach is zero-shot, meaning it does not train on the target datasets, thus explicit train/validation splits for its own method are not detailed. |
| Hardware Specification | No | The paper mentions models like GPT-3.5, Llama 2, CLIP, Open CLIP, and BLIP-2, and vision backbones such as Vi T-B/32 and Vi T-L/14@336px. However, it does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper specifies using 'gpt-3.5-turbo-0301' as the LLM and 'CLIP (Radford et al., 2021)' as the VLM, and also mentions 'Llama 2 model (Touvron et al., 2023)', 'Open CLIP (Cherti et al., 2023; Ilharco et al., 2021)', and 'BLIP-2 (Li et al., 2023)'. However, it does not provide specific version numbers for underlying software libraries, programming languages, or development environments such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Setup. We use an instruction-tuned GPT-3 model (Brown et al., 2020; Ouyang et al., 2022), gpt-3.5-turbo-0301, which is available through Open AI API 3 as our LLM, and CLIP (Radford et al., 2021) as our VLM. ... For an image x, we define the set of ambiguous classes CA, as the k most similar classes: CA = arg max {c1,...,ck} C ci cos(ϕI(x), hci) ... We use the in-context learning capabilities of LLMs (Brown et al., 2020) to guide the model to focus on visual characteristics by providing two fixed examples as part of the prompt. ... We use the following prompt template: For the following objects, generate captions that represent the distinguishing visual differences between the photos of the two objects. Generate as many captions as you can. Object 1: {class name 1} Object 2: {class name 2} |