Blending Anti-Aliasing into Vision Transformer

Authors: Shengju Qian, Hao Shao, Yi Zhu, Mu Li, Jiaya Jia

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive investigations are conducted primarily, including the filtering choices such as the basic Gaussian blurring filters in [10] and learnable convolutional filters, as well as different placements of these filters. Specifically, we propose a plug-and-play Aliasing-Reduction Module (ARM) to alleviate the aforementioned issue. We investigate the effectiveness and generalization of the proposed method across multiple tasks and various vision transformer families. This lightweight design consistently attains a clear boost over several famous structures.
Researcher Affiliation Collaboration Shengju Qian1 Hao Shao2 Yi Zhu3 Mu Li3 Jiaya Jia1 1The Chinese University of Hong Kong 2Tsinghua University 3Amazon Inc.
Pseudocode No The paper does not contain any structured pseudocode blocks or clearly labeled algorithm sections. It describes methods in prose and provides mathematical formulations.
Open Source Code Yes 1Code is made available at https://github.com/amazon-research/anti-aliasing-transformer.
Open Datasets Yes All models are trained for 300 epochs on Image Net [51] training set using 8 Tesla V100 GPUs. Reference [51] is "Imagenet: A large-scale hierarchical image database. In CVPR, 2009.", indicating a publicly available dataset.
Dataset Splits Yes We report the performance of different anti-aliased variants on Image Net [51] validation set with 50K images. In Section 4.6, for a specific analysis: "The original 50K images are split into training split with 45K images and testing split with 5K images.".
Hardware Specification Yes All models are trained for 300 epochs on Image Net [51] training set using 8 Tesla V100 GPUs. Throughput is measured on a Tesla V100 GPU using [57].
Software Dependencies No The paper mentions software like "Py Torch [58]" and refers to "open-source implementation [50]" for Swin-T, but it does not specify exact version numbers for any software libraries or dependencies within the text.
Experiment Setup Yes All training and testing parameters remain consistent with its open-source implementation [50]. Except for T2T Vi T [7], which is originally trained for 310 epochs, all models are trained for 300 epochs on Image Net [51] training set using 8 Tesla V100 GPUs. When applied with our component, the training configurations are maintained the same as the baselines, such as the optimizers and data augmentations.