On the Connection between Local Attention and Dynamic Depth-wise Convolution
Authors: Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, Jingdong Wang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The connection between local attention and dynamic depth-wise convolution is empirically verified by the ablation study about weight sharing and dynamic weight computation in Local Vision Transformer and (dynamic) depth-wise convolution. We empirically observe that the models based on depth-wise convolution and the dynamic variants with lower computation complexity perform on-par with or slightly better than Swin Transformer, an instance of Local Vision Transformer, for Image Net classification, COCO object detection and ADE semantic segmentation. |
| Researcher Affiliation | Collaboration | TKLNDST, CS, Nankai Univerisy1, Peking University2, Microsoft Research Asia3, Baidu Inc.4 |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Atten4Vis/ Demystify Local Vi T. |
| Open Datasets | Yes | Image Net-1K recognition dataset (Deng et al., 2009) contains 1.28M training images and 50K validation images with totally 1,000 classes. We use the exactly-same training setting as Swin Transformer (Liu et al., 2021b). The Adam W (Loshchilov & Hutter, 2019) optimizer for 300 epochs is adopted, with a cosine decay learning rate scheduler and 20 epochs of linear warm-up. The weight decay is 0.05, and the initial learning rate is 0.001. The augmentation and regularization strategies include Rand Augment (Cubuk et al., 2020), Mixup (Zhang et al., 2018a), Cut Mix (Yun et al., 2019), stochastic depth (Huang et al., 2016), etc. |
| Dataset Splits | Yes | Image Net-1K recognition dataset (Deng et al., 2009) contains 1.28M training images and 50K validation images with totally 1,000 classes. |
| Hardware Specification | No | The paper does not explicitly state the specific hardware used for running its experiments (e.g., specific GPU or CPU models). |
| Software Dependencies | No | The paper mentions software components like Adam W, Cascade Mask R-CNN, UPer Net, and MMSegmentation, but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | The Adam W (Loshchilov & Hutter, 2019) optimizer for 300 epochs is adopted, with a cosine decay learning rate scheduler and 20 epochs of linear warm-up. The weight decay is 0.05, and the initial learning rate is 0.001. The stochastic depth rate is employed as 0.2 and 0.5 for the tiny and base models, respectively. |