CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

Authors: Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei Chao, Rongrong Ji

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the efficacy of our CF-Vi T. For example, without any compromise on performance, CF-Vi T reduces 53% FLOPs of LV-Vi T, and also achieves 2.01 throughput.
Researcher Affiliation Collaboration 1MAC Lab, Department of Artificial Intelligence, Xiamen University 2Institute of Artificial Intelligence, Xiamen University 3Tencent Youtu Lab
Pseudocode No The paper includes figures illustrating the model architecture (e.g., Figure 2, Figure 3, Figure 4) but does not provide any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code of this project is at https://github.com/Chen Mn Z/CF-Vi T.
Open Datasets Yes We conduct the experiments on Image Net (Deng et al. 2009)
Dataset Splits Yes We feed the model 50,000 images in the validation set of Image Net with a batch size of 1,024, and record the total inference time. [...] We conduct a toy experiment on the validation set of Image Net (Deng et al. 2009) with a pre-trained Dei T-S model (Touvron et al. 2021a).
Hardware Specification Yes Our CF-Vi T model is trained on a workstation with 4 A100 GPUs. [...] The model throughput is measured as the number of processed images per second on a single A100 GPU.
Software Dependencies No The paper states 'All training settings of our CF-Vi T, such as image processing, learning rate, etc, are to follow these of Dei T and LV-Vi T.' but does not list any specific software libraries or frameworks with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, CUDA 11.x).
Experiment Setup Yes In the training phase, only conducting the fine-grained splitting at informative regions would affect the convergence. Therefore, we split the entire image into fine-grained patches in the first 200 epochs, and select informative coarse patches for fine-grained splitting in the remaining training process. [...] We feed the model 50,000 images in the validation set of Image Net with a batch size of 1,024, and record the total inference time. [...] we set the confidence threshold η = 1... [...] For a trade-off, we set α to 0.5 in our implementation. [...] ak = β ak 1 + (1 β) a0 k, where β = 0.99.