Transformer Doctor: Diagnosing and Treating Vision Transformers

Authors: Jiacong Hu, Hao Chen, Kejia Chen, Yang Gao, Jingwen Ye, Xingen Wang, Mingli Song, Zunlei Feng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a plethora of quantitative and qualitative experiments, it has been demonstrated that Transformer Doctor can effectively address internal errors in transformers, thereby enhancing model performance.
Researcher Affiliation Collaboration Jiacong Hu1,4, Hao Chen1, Kejia Chen2, Yang Gao6, Jingwen Ye3, Xingen Wang1,6, Mingli Song1,4,5, Zunlei Feng2,4,5 1College of Computer Science and Technology, Zhejiang University, 2School of Software Technology, Zhejiang University, 3Electrical and Computer Engineering, National University of Singapore, 4State Key Laboratory of Blockchain and Data Security, Zhejiang University, 5Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, 6Bangsheng Technology Co., Ltd.
Pseudocode No The paper describes methods and formulas but does not include any pseudocode or algorithm blocks.
Open Source Code Yes For more information, please visit https://transformer-doctor.github.io/. Additionally, the algorithm code for the Transformer Doctor is included in the uploaded source_codes.zip file.
Open Datasets Yes To validate the effectiveness of Transformer Doctor, we conducted experiments on five mainstream datasets: CIFAR-10 [67], CIFAR-100 [67], Image Net-10 [68], Image Net-50 [69], and Image Net-1k [68].
Dataset Splits No The paper mentions training and testing but does not provide specific train/validation/test dataset splits (percentages or counts) needed for reproduction.
Hardware Specification Yes In the experiments, we utilized two Linux servers, each equipped with 8 NVIDIA A6000 GPU cards, 24 CPU cores, and 500GB of memory.
Software Dependencies No The paper mentions using the Adam W [70] optimizer but does not provide specific version numbers for other key software components or libraries.
Experiment Setup Yes During all training stage, each dataset was trained for 300 epochs using the Adam W [70] optimizer, with an initial learning rate of 0.01. The learning rate decayed according to a cosine annealing schedule, with T_max set to 300 epochs. Additionally, α and β were set to default values of 10 and 100, respectively, to balance each loss function. The default value of τ was 0.15, and the constrained loss function was applied by default to the last block.