CF-ViT: A General Coarse-to-Fine Method for Vision Transformer
Authors: Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei Chao, Rongrong Ji
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the efficacy of our CF-Vi T. For example, without any compromise on performance, CF-Vi T reduces 53% FLOPs of LV-Vi T, and also achieves 2.01 throughput. |
| Researcher Affiliation | Collaboration | 1MAC Lab, Department of Artificial Intelligence, Xiamen University 2Institute of Artificial Intelligence, Xiamen University 3Tencent Youtu Lab |
| Pseudocode | No | The paper includes figures illustrating the model architecture (e.g., Figure 2, Figure 3, Figure 4) but does not provide any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code of this project is at https://github.com/Chen Mn Z/CF-Vi T. |
| Open Datasets | Yes | We conduct the experiments on Image Net (Deng et al. 2009) |
| Dataset Splits | Yes | We feed the model 50,000 images in the validation set of Image Net with a batch size of 1,024, and record the total inference time. [...] We conduct a toy experiment on the validation set of Image Net (Deng et al. 2009) with a pre-trained Dei T-S model (Touvron et al. 2021a). |
| Hardware Specification | Yes | Our CF-Vi T model is trained on a workstation with 4 A100 GPUs. [...] The model throughput is measured as the number of processed images per second on a single A100 GPU. |
| Software Dependencies | No | The paper states 'All training settings of our CF-Vi T, such as image processing, learning rate, etc, are to follow these of Dei T and LV-Vi T.' but does not list any specific software libraries or frameworks with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, CUDA 11.x). |
| Experiment Setup | Yes | In the training phase, only conducting the fine-grained splitting at informative regions would affect the convergence. Therefore, we split the entire image into fine-grained patches in the first 200 epochs, and select informative coarse patches for fine-grained splitting in the remaining training process. [...] We feed the model 50,000 images in the validation set of Image Net with a batch size of 1,024, and record the total inference time. [...] we set the confidence threshold η = 1... [...] For a trade-off, we set α to 0.5 in our implementation. [...] ak = β ak 1 + (1 β) a0 k, where β = 0.99. |