reproducibilityindex.ai

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Authors: Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the FSC147 and the CARPK datasets show that CACVi T signiﬁcantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACVi T provides a concise and strong baseline for CAC.
Researcher Affiliation	Academia	Key Laboratory of Image Processing and Intelligent Control, Ministry of Education; School of Artiﬁcial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China
Pseudocode	No	The paper describes the conceptual framework and processes, such as the pipeline of CACVi T and the decoupled view of self-attention, but does not provide any formal pseudocode or algorithm blocks.
Open Source Code	No	Code will be available.
Open Datasets	Yes	Extensive experiments on the FSC147 and the CARPK datasets show that CACVi T signiﬁcantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACVi T provides a concise and strong baseline for CAC.
Dataset Splits	Yes	Experiments on the public benchmark FSC147 (Ranjan et al. 2021) show that CACVit outperforms the previous best approaches by large margins, with relative error reductions of 19.04% and 23.60% on the validation and test sets, respectively, in terms of mean absolute error.
Hardware Specification	Yes	Our model is trained and tested on NVIDIA Ge Force RTX 3090.
Software Dependencies	No	The paper mentions using "Adam W (Loshchilov and Hutter 2017) as the optimizer" but does not specify version numbers for any software, libraries, or programming languages used in the experiments.
Experiment Setup	Yes	The network takes the image of size 384 384 as the input, which is ﬁrst split into patches of size 16 16. Each exemplar is of size 64 64, then split into patches of size 16 16. Our feature extractor, pretrained with MAE (He et al. 2022), consists of 12 transformer encoder blocks with a hidden dimension of 768, and each multi-head self-attention layer contains 12 heads. The following 3 extra transformer blocks with a hidden dimension of 512 are adopted to enhance the feature and reduce the dimension for upsampling. Our regression decoder consists of 4 up-sampling layers with a hidden dimension of 256 as in Coun TR (Liu et al. 2022). For fair comparison, we use the same data augmentation, test-time cropping and normalization as Coun TR (Liu et al. 2022). We apply Adam W (Loshchilov and Hutter 2017) as the optimizer with a batch size of 8. The model is trained for 200 epochs with a learning rate of 1e 4, a weight decay rate of 0.05, and 10 epochs for warm up.