CoAtFormer: Vision Transformer with Composite Attention

Authors: Zhiyong Chang, Mingjun Yin, Yan Wang

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show our Co At Former achieves state-of-the-art results on various different tasks.
Researcher Affiliation Collaboration Zhiyong Chang1 , Mingjun Yin2 , Yan Wang3 1Peking University 2The University of Melbourne 3Zuoyebang
Pseudocode No The paper includes mathematical formulations (Equations 1-19) and architectural diagrams (Figures 2 and 3), but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not include any explicit statement about releasing source code for the methodology or a link to a code repository.
Open Datasets Yes We conduct experiments on Image Net-1K [Deng et al., 2009] classification, COCO [Lin et al., 2014] object detection and instance segmentation, and ADE20K [Zhou et al., 2017] semantic segmentation.
Dataset Splits Yes We conduct experiments on Image Net-1K [Deng et al., 2009] classification, COCO [Lin et al., 2014] object detection and instance segmentation, and ADE20K [Zhou et al., 2017] semantic segmentation. For fair comparison, we follow the same training strategies as previous works [Touvron et al., 2020; Liu et al., 2021].
Hardware Specification No The paper mentions computational costs (e.g., FLOPs) and parameters but does not specify any hardware details like GPU/CPU models, memory, or specific cloud computing instances used for experiments.
Software Dependencies No The paper mentions software components such as 'Adam W optimizer', 'Mask R-CNN', 'Cascade Mask R-CNN', 'Uper Net', and 'GELU activation', but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes For fair comparison, we follow the same training strategies as previous works [Touvron et al., 2020; Liu et al., 2021]. Specifically, we train all our models for 300 epochs with the input size of 224 224. We employ the Adam W optimizer with weight decay of 0.05. The default batch size and initial learning rate are set to 1024 and 0.001.