reproducibilityindex.ai

Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

Authors: Wei Dong, Yuan Sun, Yiting Yang, Xing Zhang, Zhijun Lin, Qingsen Yan, Haokui Zhang, Peng Wang, Yang Yang, Hengtao Shen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance. We conducted experiments on a set of downstream vision classification tasks. The results show that our method can be effectively applied to various Vi T versions, achieving promising fine-tuning performance.
Researcher Affiliation	Academia	Wei Dong1, Yuan Sun1, Yiting Yang1, Xing Zhang1, Zhijun Lin2, Qingsen Yan2, Haokui Zhang2, Peng Wang3 , Yang Yang3, Hengtao Shen34 1College of Information and Control Engineering, Xi an University of Architecture and Technology 2School of Computer Science, Northwestern Polytechnical University 3School of Computer Science and Engineering, University of Electronic Science and Technology of China 4School of Computer Science and Technology, Tongji University
Pseudocode	No	The paper describes the proposed method using mathematical equations and textual explanations, but it does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	Due to the limitation that supplementary materials larger than 100MB cannot be uploaded to the Open Review website, only the project code as the concise supplementary materials is uploaded to this website. Please reler to the anonymous linkhttps://drive.google.com/file/d/18s Xhtq Ml KZd4_ LRICk2Nv Sl Ki Fi Hr G2d/view?to obtain the complete code, datasets, and models.
Open Datasets	Yes	We evaluated the effectiveness of our method using two sets of visual adaptation benchmarks: FGVC and VTAB-1k, involving a total of 24 datasets. The FGVC collection consists of five Fine-Grained Visual Classification (FGVC) datasets: CUB-200-2011, NABirds, Oxford Flowers, Stanford Dogs, and Stanford Cars. The VTAB-1k benchmark comprises 19 diverse visual classification tasks, divided into three categories: Natural, which includes images captured by standard cameras; Specialized, which includes images captured by specialized equipment such as remote sensing and medical imaging devices; and Structured, which includes synthesized images from simulated environments, such as object counting and 3D depth prediction. Each VTAB-1k task includes 1,000 training samples. Detailed dataset statistic We provide detailed information about the datasets used in this paper, including the number of classes and the sizes of the training, validation, and test sets, in Table 1 (FGVC) and Table 2 (VTAB-1k).
Dataset Splits	Yes	We provide detailed information about the datasets used in this paper, including the number of classes and the sizes of the training, validation, and test sets, in Table 1 (FGVC) and Table 2 (VTAB-1k).
Hardware Specification	Yes	All experiments are conducted using the Py Torch framework [36] on an NVIDIA A800 GPU with 80 GB of memory.
Software Dependencies	No	The paper mentions using "the Py Torch framework [36]" and "the Adam W [35] optimizer" but does not specify their version numbers, which are required for a reproducible description.
Experiment Setup	Yes	We used the Adam W [35] optimizer to fine-tune the models for 100 epochs. The learning rate schedule was managed using the cosine decay strategy. Table 3 provides a summary of the configurations used in our experiments. As discussed in Section 4, we performed a grid search on the validation set of each task to determine the optimal hyperparameters, including learning rate, weight decay, batch size, and dropout rate. Optimizer Adam W Learning Rate {0.2, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0001} Weight Decay {0.05, 0.01, 0.005, 0.001, 0} Batch Size {64, 32, 16} Adapter Dropout {0.5, 0.3, 0.2, 0.1, 0} Learning Rate Schedule Cosine Decay Training Epochs 100 Warmup Epochs 10