Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SeTformer Is What You Need for Vision and Language

Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Michael Felsberg

AAAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In particular, with small and base-sized models, Se Tformer achieves impressive top-1 accuracies of 84.7% and 86.2% on Image Net-1K. In object detection, Se Tformer-base outperforms the Focal Net counterpart by +2.2 m AP, using 38% fewer parameters and 29% fewer FLOPs. In semantic segmentation, our base-size model surpasses NAT by +3.5 m Io U with 33% fewer parameters. Se Tformer also achieves state-of-the-art results in language modeling on the GLUE benchmark. These findings highlight Se Tformer applicability for vision and language tasks.
Researcher Affiliation	Academia	Pourya Shamsolmoali1, Masoumeh Zareapoor2*, Eric Granger3, Michael Felsberg4 1 School of Communication and Electronic Eng., East China Normal University 2 Department of Electrical Eng., Shanghai Jiaotong University 3 LIVIA, Department of Systems Eng., ETS Montreal 4 Computer Vision Laboratory, Link oping University EMAIL
Pseudocode	No	The paper describes the architecture and method in text and diagrams but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	In the supplementary materials, we provided additional visualization results and the source code.
Open Datasets	Yes	We conduct experiments on both image and language domains, including Image Net, COCO, and ADE20K, as well as the GLUE to demonstrate the impact of our model. (Deng et al. 2009), (Lin et al. 2014), (Zhou et al. 2019), (Wang et al. 2018), (Merity et al. 2017).
Dataset Splits	Yes	Models are trained on 118K training images and evaluated on 5K validation set using Mask R-CNN (He et al. 2017). Following Swin s training setting, we utilize an Adam W (Kingma and Ba 2014) for 300 iterations, with 20 for warm-up of the learning rate, followed by gradual decay, and then perform ten cool down epochs.
Hardware Specification	Yes	We also note that, the throughputs are measured on a V100 GPU.
Software Dependencies	No	The paper mentions using Adam W for optimization but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	Following Swin s training setting, we utilize an Adam W (Kingma and Ba 2014) for 300 iterations, with 20 for warm-up of the learning rate, followed by gradual decay, and then perform ten cool down epochs. We choose the number of references m as 750, and set ϵ to 0.3 and τ to 0.8. Methods are trained for 160K epochs using batch size 16, following (Liu et al. 2021). Our optimal result is achieved with m = 800, ϵ and τ set to 0.3 and 0.8, respectively.