CLE-ViT: Contrastive Learning Encoded Transformer for Ultra-Fine-Grained Visual Categorization

Authors: Xiaohan Yu, Jun Wang, Yongsheng Gao

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental CLE-Vi T demonstrates strong performance on 7 publicly available datasets, demonstrating its effectiveness in the ultra-FGVC task.
Researcher Affiliation Academia 1School of Engineering and Built Environment, Griffith University, Australia 2Department of Computer Science, University of Warwick, UK {xiaohan.yu, yongsheng.gao}@griffith.edu.au, jun.wang.3@warwick.ac.uk
Pseudocode No The paper describes the methods textually and with diagrams (e.g., Figure 4), but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Markin-Wang/CLEVi T
Open Datasets Yes Following [Yu et al., 2023], five ultra-fine-grained image datasets are adopted for evaluation including Cotton80, Soy Local, Soy Gene, Soy Ageing and Soy Global. Moreover, two fine-grained datasets, Apple Foliar disease dataset [Thapa et al., 2020] and CUB-200-2011 (CUB) [Wah et al., 2011] are also used to further verify the effectiveness of the proposed method.
Dataset Splits No Table 1 provides the number of training and test images for each dataset, but does not explicitly mention a separate validation split or its size/proportion.
Hardware Specification No The paper mentions using a Swin Transformer Base as the backbone model but does not specify any hardware details like GPU models, CPU types, or memory used for experiments.
Software Dependencies No The paper mentions using the Adam W optimizer and adopting standard data augmentation techniques, but it does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes The proportion of the masked region and the number of parts n and are set to [0.15, 0.45] and 4 respectively. The margin β in Equation 7 is 1. λ and γ are both set to 1 for all datasets except 0.3 and 0.5 for CUB dataset. ...input images are first resized to 600 600 for all datasets. Random (Center) cropping is then applied to crop the images into 448 448 during the training (inference) phase. After that, we adopt random horizontal flipping, color jitter, and random rotation during the training. The whole architecture is optimized by Adam W optimizer. In our experiment settings, the batch size and the learning rate are set to 12 and 1e-3 for all the datasets.