TransFG: A Transformer Architecture for Fine-Grained Recognition

Authors: Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang852-860

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Trans FG and demonstrate the value of it by conducting experiments on five popular fine-grained benchmarks where we achieve state-of-the-art performance.
Researcher Affiliation Collaboration Ju He1, Jie-Neng Chen1, Shuai Liu3, Adam Kortylewski2, Cheng Yang3, Yutong Bai1, Changhu Wang3 1 Johns Hopkins University 2 Max Planck Institute for Informatics 3 Byte Dance Inc.
Pseudocode No The paper describes its method in detail using equations and textual descriptions, but it does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a direct link to open-source code or explicitly state that the code for their method is released.
Open Datasets Yes We evaluate our proposed Trans FG on five widely used fine-grained benchmarks, i.e., CUB-200-2011 (Wah et al. 2011), Stanford Cars (Krause et al. 2013), Stanford Dogs (Khosla et al. 2011), NABirds (Van Horn et al. 2015) and i Nat2017 (Van Horn et al. 2018).
Dataset Splits Yes First, we resize input images to 448 448 except 304 304 on i Nat2017 for fair comparison (random cropping for training and center cropping for testing).
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper mentions loading weights from 'official Vi T-B 16 model pretrained on Image Net21k' and using 'SGD optimizer'. However, it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Unless stated otherwise, we implement Trans FG as follows. First, we resize input images to 448 448 except 304 304 on i Nat2017 for fair comparison (random cropping for training and center cropping for testing). We split image to patches of size 16 and the step size of sliding window is set to be 12. Thus the H, W, P, S in Eq 1 are 448, 448, 16, 12 respectively. The margin α in Eq 9 is set to be 0.4. We load intermediate weights from official Vi T-B 16 model pretrained on Image Net21k. The batch size is set to 16. SGD optimizer is employed with a momentum of 0.9. The learning rate is initialized as 0.03 except 0.003 for Stanford Dogs dataset and 0.01 for i Nat2017 dataset. We adopt cosine annealing as the scheduler of optimizer.