SeTformer Is What You Need for Vision and Language

Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Michael Felsberg

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In particular, with small and base-sized models, Se Tformer achieves impressive top-1 accuracies of 84.7% and 86.2% on Image Net-1K. In object detection, Se Tformer-base outperforms the Focal Net counterpart by +2.2 m AP, using 38% fewer parameters and 29% fewer FLOPs. In semantic segmentation, our base-size model surpasses NAT by +3.5 m Io U with 33% fewer parameters. Se Tformer also achieves state-of-the-art results in language modeling on the GLUE benchmark. These findings highlight Se Tformer applicability for vision and language tasks.
Researcher Affiliation Academia Pourya Shamsolmoali1, Masoumeh Zareapoor2*, Eric Granger3, Michael Felsberg4 1 School of Communication and Electronic Eng., East China Normal University 2 Department of Electrical Eng., Shanghai Jiaotong University 3 LIVIA, Department of Systems Eng., ETS Montreal 4 Computer Vision Laboratory, Link oping University mzarea@ieee.org
Pseudocode No The paper describes the architecture and method in text and diagrams but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes In the supplementary materials, we provided additional visualization results and the source code.
Open Datasets Yes We conduct experiments on both image and language domains, including Image Net, COCO, and ADE20K, as well as the GLUE to demonstrate the impact of our model. (Deng et al. 2009), (Lin et al. 2014), (Zhou et al. 2019), (Wang et al. 2018), (Merity et al. 2017).
Dataset Splits Yes Models are trained on 118K training images and evaluated on 5K validation set using Mask R-CNN (He et al. 2017). Following Swin s training setting, we utilize an Adam W (Kingma and Ba 2014) for 300 iterations, with 20 for warm-up of the learning rate, followed by gradual decay, and then perform ten cool down epochs.
Hardware Specification Yes We also note that, the throughputs are measured on a V100 GPU.
Software Dependencies No The paper mentions using Adam W for optimization but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes Following Swin s training setting, we utilize an Adam W (Kingma and Ba 2014) for 300 iterations, with 20 for warm-up of the learning rate, followed by gradual decay, and then perform ten cool down epochs. We choose the number of references m as 750, and set ϵ to 0.3 and τ to 0.8. Methods are trained for 160K epochs using batch size 16, following (Liu et al. 2021). Our optimal result is achieved with m = 800, ϵ and τ set to 0.3 and 0.8, respectively.