Vocabulary-free Image Classification
Authors: Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on benchmark datasets validate that Ca SED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction1. We experiment on several datasets, considering both coarse(e.g. Caltech-101 [14], UCF101 [55]) and fine-grained (e.g. FGVC-Aircraft [40], Flowers-102 [43]) classification tasks. |
| Researcher Affiliation | Collaboration | Alessandro Conti1 Enrico Fini1 Massimiliano Mancini1 Paolo Rota1 Yiming Wang2 Elisa Ricci1,2 1University of Trento 2Fondazione Bruno Kessler (FBK) |
| Pseudocode | No | No. The paper describes the method steps in narrative form and with equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and demo is available at https://github.com/altndrr/vic |
| Open Datasets | Yes | Datasets. We follow existing works [53, 66] and use ten datasets that feature both coarse-grained and fine-grained classification in different domains: Caltech-101 (C101) [14], DTD [7], Euro SAT (ESAT) [21], FGVC-Aircraft (Airc.) [40], Flowers-102 (Flwr) [43], Food-101 (Food) [4], Oxford Pets (Pets), Stanford Cars (Cars) [29], SUN397 (SUN) [61], and UCF101 (UCF) [55]. Additionally, we used Image Net [10] for hyperparameters tuning. As database, we use a subset of PMD [54], containing five of its largest datasets: Conceptual Captions (CC3M) [52], Conceptual Captions 12M (CC12M) [5], Wikipedia Image Text (WIT) [56], Redcaps [12], and a subset of [57] used for PMD (YFCC100M*). |
| Dataset Splits | No | No. The paper mentions using ImageNet for hyperparameter tuning but does not provide specific training/validation/test split percentages, sample counts, or citations to predefined splits for the main datasets (Caltech-101, DTD, etc.) used for evaluation. |
| Hardware Specification | Yes | Implementation details. Our experiments were conducted using NVIDIA A6000 GPUs with mixedbit precision. |
| Software Dependencies | No | No. The paper mentions the use of CLIP and the NLP library flair (https://github.com/flair NLP/flair) but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We tuned the α hyperparameter of Eq. (6) and the number of retrieved captions K of our method on the Image Net dataset, finding that α = 0.7 and K = 10 led to the best results. We use these values across all experiments. |