Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting
Authors: Zhicheng Wang, Liwen Xiao, Zhiguo Cao, Hao Lu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the FSC147 and the CARPK datasets show that CACVi T significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACVi T provides a concise and strong baseline for CAC. |
| Researcher Affiliation | Academia | Key Laboratory of Image Processing and Intelligent Control, Ministry of Education; School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China |
| Pseudocode | No | The paper describes the conceptual framework and processes, such as the pipeline of CACVi T and the decoupled view of self-attention, but does not provide any formal pseudocode or algorithm blocks. |
| Open Source Code | No | Code will be available. |
| Open Datasets | Yes | Extensive experiments on the FSC147 and the CARPK datasets show that CACVi T significantly outperforms state-of-the-art CAC approaches in both effectiveness (23.60% error reduction) and generalization, which suggests CACVi T provides a concise and strong baseline for CAC. |
| Dataset Splits | Yes | Experiments on the public benchmark FSC147 (Ranjan et al. 2021) show that CACVit outperforms the previous best approaches by large margins, with relative error reductions of 19.04% and 23.60% on the validation and test sets, respectively, in terms of mean absolute error. |
| Hardware Specification | Yes | Our model is trained and tested on NVIDIA Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions using "Adam W (Loshchilov and Hutter 2017) as the optimizer" but does not specify version numbers for any software, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | The network takes the image of size 384 384 as the input, which is first split into patches of size 16 16. Each exemplar is of size 64 64, then split into patches of size 16 16. Our feature extractor, pretrained with MAE (He et al. 2022), consists of 12 transformer encoder blocks with a hidden dimension of 768, and each multi-head self-attention layer contains 12 heads. The following 3 extra transformer blocks with a hidden dimension of 512 are adopted to enhance the feature and reduce the dimension for upsampling. Our regression decoder consists of 4 up-sampling layers with a hidden dimension of 256 as in Coun TR (Liu et al. 2022). For fair comparison, we use the same data augmentation, test-time cropping and normalization as Coun TR (Liu et al. 2022). We apply Adam W (Loshchilov and Hutter 2017) as the optimizer with a batch size of 8. The model is trained for 200 epochs with a learning rate of 1e 4, a weight decay rate of 0.05, and 10 epochs for warm up. |