reproducibilityindex.ai

Dynamic Grained Encoder for Vision Transformers

Authors: Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin Sun, Jian Sun, Nanning Zheng

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach.
Researcher Affiliation	Collaboration	Lin Song1 Songyang Zhang2,4,5 Songtao Liu3 Zeming Li3 Xuming He2 Hongbin Sun1 Jian Sun3 Nanning Zheng1 1 College of Artiﬁcial Intelligence, Xi an Jiaotong University 2 Shanghai Tech University 3 Megvii Inc. (Face++) 4University of Chinese Academy of Sciences 5Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/Steven Grove/vtpack.
Open Datasets	Yes	Image Net [20] classiﬁcation dataset, COCO dataset [58], ADE-20K [62] dataset.
Dataset Splits	Yes	To investigate the spatial redundancy of vision transformer on image data, we conduct a series of experiments on the Image Net [20] val set with a pre-trained Dei T-S [19] model. All the experiments for image classiﬁcation are based on Image Net [20] classiﬁcation dataset. We apply our models for object detection and instance segmentation on the COCO dataset [58]. We further evaluate our models as the backbones for Semantic-FPN [61] on ADE-20K [62] dataset. These datasets typically have well-defined public splits.
Hardware Specification	Yes	During the training phase, we use four compute nodes with 32 Nvidia Tesla V100 GPUs. For the runtime evaluation, we measure the frameworks on both Intel Xeon Gold 6130 CPU and Nvidia Tesla V100 GPU to demonstrate the efﬁciency of our dynamic networks.
Software Dependencies	No	The paper mentions using 'Detectron2' and 'MM-Segmentation toolkit' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	All the experiments for image classiﬁcation are based on Image Net [20] classiﬁcation dataset. We use 256 2564 as the input image resolution for training and evaluation. For a fair comparison, we follow the training settings in Dei T and PVT. Speciﬁcally, the random-size cropping, random horizontal ﬂipping [53] and mixup [54] are used for data augmentation. We use the Adam W [55] optimizer with the weight decay of 0.05 and the momentum of 0.9. The learning rate is initially set to 0.001 and decreases according to the cosine schedule [56]. All the models are trained for 300 epochs with 128 images per batch. The label-smoothing regularization is used in the training phase. Besides, for the dynamic grained encoders, λ is set to 1.0 and Φ is set to {1, 2, 4} by default.