Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling

Authors: Liang Yin, Xudong Xie, Zhang Li, Xiang Bai, Yuliang Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. Notably, MSTAR marginally surpasses the previous state-of-the-art model by 6.4% in MAP on Total-Text while eliminating box annotation costs. Moreover, on the MQTR benchmark, MSTAR significantly outperforms the previous models by an average of 8.5%. The code and datasets are available at https://github.com/yingift/MSTAR.
Researcher Affiliation Academia Huazhong University of Science and Technology EMAIL
Pseudocode No The paper describes the overall architecture of MSTAR in Fig. 2 and provides mathematical formulations (e.g., Eq. 1, 2, 3, 4, 5) for its components. However, it does not contain explicit pseudocode or algorithm blocks with structured steps labeled as such.
Open Source Code Yes The code and datasets are available at https://github.com/yingift/MSTAR.
Open Datasets Yes Extensive experiments demonstrate the superiority of our method across seven public datasets and the MQTR dataset. ... To comprehensively evaluate the performance of models on multi-query scene text retrieval, we carefully build the Multi-Query Text Retrieval (MQTR) dataset. ... The construction of this dataset leverages well-annotated public datasets [37, 21, 13, 14, 7, 34, 24, 51], along with images obtained from Google Image Search. ... The code and datasets are available at https://github.com/yingift/MSTAR.
Dataset Splits Yes The MQTR dataset includes four sub-tasks: word, phrase, combined, and semantic retrieval. The word, phrase, and combined subsets each contain 5,000 images and the 200 most frequently occurring queries. The semantic subset consists of 1,000 images and 25 queries collected from the web. ... We collect a training dataset consisting of 95k images. First, 50k synthetic images with word transcriptions are leveraged from Synth Text-900k [9]. Then 20k real images containing captions are collected from Text Cap [34]. ... For the word retrieval experiment, the synthetic data Dsyn refers to 100K images randomly sampled from Synth Text-900k, and the real data Dreal comes from MLT-5K. For the multi-query experiment, Dsyn includes 50K images randomly sampled from Synth Text-900k and 25k images with phrase transcriptions with the synthesis engine. Dreal is the training set from the Text Cap dataset.
Hardware Specification Yes The MSTAR model was trained on four NVIDIA A800 GPUs and evaluated on a single GPU, using the Adam W optimizer [25].
Software Dependencies No The MSTAR model was trained on four NVIDIA A800 GPUs and evaluated on a single GPU, using the Adam W optimizer [25]. The visual encoder ϕ is initialized from Vi T-Base-512 of Sig LIP [52]. The multi-modal encoder ψ is initialized from BLIP-2 [19]. While these mention specific models and an optimizer, they do not provide specific version numbers for libraries like PyTorch, Python, or CUDA, which are essential for full reproducibility.
Experiment Setup Yes The number of query tokens Ql is set to 64 with interpolation, which is consistent with the setting of the vanilla BLIP-2 in our comparison experiments. The MSTAR model was trained on four NVIDIA A800 GPUs and evaluated on a single GPU, using the Adam W optimizer [25]. A multi-stage training is adopted with progressive resolution increasing from 512 512, 640 640, to 800 800. For re-ranking, the top 2% of images are selected from the initial retrieval results. ... Table 13: Training details of our model. Phase 1, Phase 2, Phase 3, Phrase 4; Image Resolution 512, 640, 800, 800; Learning Rate 1e-5, 1e-5, 5e-6, 5e-6; Warm-Up steps 100, 100, 0, 0; Freeze Vi T False, False, False, True; Precision of Vi T Float; Query of ψ 64; Random Crop True.