Slicing Vision Transformer for Flexible Inference

Authors: Yitian Zhang, n n, Xu Ma, Huan Wang, Ke Ma, Stephen Chen, Derek Hu, Yun Fu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive empirical validations on different tasks demonstrate that with only one-shot training, Scala learns slimmable representation without modifying the original Vi T structure and matches the performance of Separate Training.
Researcher Affiliation Collaboration 1Snap Inc. 2Northeastern University 3Meta
Pseudocode No The paper describes its methods using text and mathematical equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No Code is available at here. (In the NeurIPS Paper Checklist, it states: 'We have included the details and will release the code soon.') The statement 'will release the code soon' indicates that the code is not concretely available at the time of publication.
Open Datasets Yes All the object recognition experiments are carried out on Image Net-1K [8].
Dataset Splits Yes All the object recognition experiments are carried out on Image Net-1K [8]. We follow the training recipe of Dei T [29] and conduct the experiments on 4 V100 GPUs.
Hardware Specification Yes All the object recognition experiments are carried out on Image Net-1K [8]. We follow the training recipe of Dei T [29] and conduct the experiments on 4 V100 GPUs. The models are trained on 4 V100 and 8 A100 GPUs with a total batch size of 1024.
Software Dependencies No The paper mentions software components like 'Adam W' and data augmentation techniques ('Rand Augment', 'Mixup', 'Cut Mix', 'Random Erasing'), but does not specify their version numbers or the versions of underlying programming languages or libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes For Scala, we set s = 0.25, l = 1.0, and ϵ = 0.0625 so that we could enable a single Vi T to represent 13 different networks (X = 13) with a large slicing bound (i.e., F l ( ) is almost 16 times larger than F s ( )). We use random horizontal flipping, random erasing [44], Mixup [42], Cut Mix [40], and Rand Augment [6] for data augmentation. Adam W [23] is utilized as the optimizer with a momentum of 0.9 and a weight decay of 0.05. We set the learning rate to 1e-3 and decay with a cosine shape.