reproducibilityindex.ai

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

Authors: Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei Chao, Rongrong Ji

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate the efficacy of our CF-Vi T. For example, without any compromise on performance, CF-Vi T reduces 53% FLOPs of LV-Vi T, and also achieves 2.01 throughput.
Researcher Affiliation	Collaboration	1MAC Lab, Department of Artificial Intelligence, Xiamen University 2Institute of Artificial Intelligence, Xiamen University 3Tencent Youtu Lab
Pseudocode	No	The paper includes figures illustrating the model architecture (e.g., Figure 2, Figure 3, Figure 4) but does not provide any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Code of this project is at https://github.com/Chen Mn Z/CF-Vi T.
Open Datasets	Yes	We conduct the experiments on Image Net (Deng et al. 2009)
Dataset Splits	Yes	We feed the model 50,000 images in the validation set of Image Net with a batch size of 1,024, and record the total inference time. [...] We conduct a toy experiment on the validation set of Image Net (Deng et al. 2009) with a pre-trained Dei T-S model (Touvron et al. 2021a).
Hardware Specification	Yes	Our CF-Vi T model is trained on a workstation with 4 A100 GPUs. [...] The model throughput is measured as the number of processed images per second on a single A100 GPU.
Software Dependencies	No	The paper states 'All training settings of our CF-Vi T, such as image processing, learning rate, etc, are to follow these of Dei T and LV-Vi T.' but does not list any specific software libraries or frameworks with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, CUDA 11.x).
Experiment Setup	Yes	In the training phase, only conducting the fine-grained splitting at informative regions would affect the convergence. Therefore, we split the entire image into fine-grained patches in the first 200 epochs, and select informative coarse patches for fine-grained splitting in the remaining training process. [...] We feed the model 50,000 images in the validation set of Image Net with a batch size of 1,024, and record the total inference time. [...] we set the confidence threshold η = 1... [...] For a trade-off, we set α to 0.5 in our implementation. [...] ak = β ak 1 + (1 β) a0 k, where β = 0.99.