Dynamic Grained Encoder for Vision Transformers

Authors: Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin Sun, Jian Sun, Nanning Zheng

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach.
Researcher Affiliation Collaboration Lin Song1 Songyang Zhang2,4,5 Songtao Liu3 Zeming Li3 Xuming He2 Hongbin Sun1 Jian Sun3 Nanning Zheng1 1 College of Artificial Intelligence, Xi an Jiaotong University 2 Shanghai Tech University 3 Megvii Inc. (Face++) 4University of Chinese Academy of Sciences 5Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Steven Grove/vtpack.
Open Datasets Yes Image Net [20] classification dataset, COCO dataset [58], ADE-20K [62] dataset.
Dataset Splits Yes To investigate the spatial redundancy of vision transformer on image data, we conduct a series of experiments on the Image Net [20] val set with a pre-trained Dei T-S [19] model. All the experiments for image classification are based on Image Net [20] classification dataset. We apply our models for object detection and instance segmentation on the COCO dataset [58]. We further evaluate our models as the backbones for Semantic-FPN [61] on ADE-20K [62] dataset. These datasets typically have well-defined public splits.
Hardware Specification Yes During the training phase, we use four compute nodes with 32 Nvidia Tesla V100 GPUs. For the runtime evaluation, we measure the frameworks on both Intel Xeon Gold 6130 CPU and Nvidia Tesla V100 GPU to demonstrate the efficiency of our dynamic networks.
Software Dependencies No The paper mentions using 'Detectron2' and 'MM-Segmentation toolkit' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes All the experiments for image classification are based on Image Net [20] classification dataset. We use 256 2564 as the input image resolution for training and evaluation. For a fair comparison, we follow the training settings in Dei T and PVT. Specifically, the random-size cropping, random horizontal flipping [53] and mixup [54] are used for data augmentation. We use the Adam W [55] optimizer with the weight decay of 0.05 and the momentum of 0.9. The learning rate is initially set to 0.001 and decreases according to the cosine schedule [56]. All the models are trained for 300 epochs with 128 images per batch. The label-smoothing regularization is used in the training phase. Besides, for the dynamic grained encoders, λ is set to 1.0 and Φ is set to {1, 2, 4} by default.