Transformer Doctor: Diagnosing and Treating Vision Transformers
Authors: Jiacong Hu, Hao Chen, Kejia Chen, Yang Gao, Jingwen Ye, Xingen Wang, Mingli Song, Zunlei Feng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a plethora of quantitative and qualitative experiments, it has been demonstrated that Transformer Doctor can effectively address internal errors in transformers, thereby enhancing model performance. |
| Researcher Affiliation | Collaboration | Jiacong Hu1,4, Hao Chen1, Kejia Chen2, Yang Gao6, Jingwen Ye3, Xingen Wang1,6, Mingli Song1,4,5, Zunlei Feng2,4,5 1College of Computer Science and Technology, Zhejiang University, 2School of Software Technology, Zhejiang University, 3Electrical and Computer Engineering, National University of Singapore, 4State Key Laboratory of Blockchain and Data Security, Zhejiang University, 5Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, 6Bangsheng Technology Co., Ltd. |
| Pseudocode | No | The paper describes methods and formulas but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | For more information, please visit https://transformer-doctor.github.io/. Additionally, the algorithm code for the Transformer Doctor is included in the uploaded source_codes.zip file. |
| Open Datasets | Yes | To validate the effectiveness of Transformer Doctor, we conducted experiments on five mainstream datasets: CIFAR-10 [67], CIFAR-100 [67], Image Net-10 [68], Image Net-50 [69], and Image Net-1k [68]. |
| Dataset Splits | No | The paper mentions training and testing but does not provide specific train/validation/test dataset splits (percentages or counts) needed for reproduction. |
| Hardware Specification | Yes | In the experiments, we utilized two Linux servers, each equipped with 8 NVIDIA A6000 GPU cards, 24 CPU cores, and 500GB of memory. |
| Software Dependencies | No | The paper mentions using the Adam W [70] optimizer but does not provide specific version numbers for other key software components or libraries. |
| Experiment Setup | Yes | During all training stage, each dataset was trained for 300 epochs using the Adam W [70] optimizer, with an initial learning rate of 0.01. The learning rate decayed according to a cosine annealing schedule, with T_max set to 300 epochs. Additionally, α and β were set to default values of 10 and 100, respectively, to balance each loss function. The default value of τ was 0.15, and the constrained loss function was applied by default to the last block. |