reproducibilityindex.ai

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Authors: Zhaokai Wang, Renda Bao, Qi Wu, Si Liu2835-2843

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our model outperforms state-of-the-art models on Text Caps dataset, improving from 81.0 to 93.0 in CIDEr. Our model signiﬁcantly outperforms current state-of-the-art model of Text Caps dataset by 12.0 in CIDEr on test set, improving from 81.0 to 93.0.
Researcher Affiliation	Collaboration	1 Beihang University, Beijing, China 2 Alibaba Group, Beijing, China 3 University of Adelaide, Australia
Pseudocode	No	The paper provides descriptions of its modules and equations but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our source code is publicly available 1. 1https://github.com/wzk1015/CNMT
Open Datasets	Yes	We train our model on Text Caps dataset, and evaluate its performance on validation set and test set. Sidorov et al. has introduced Text Caps (Sidorov et al. 2020) dataset.
Dataset Splits	Yes	We train our model on Text Caps dataset, and evaluate its performance on validation set and test set. At every 500 iterations we compute the BLEU-4 metric on validation set, and select the best model based on all of them.
Hardware Specification	Yes	The entire training takes approximately 12 hours on 4 RTX 2080 Ti GPUs.
Software Dependencies	No	The paper mentions 'BERT-BASE (Devlin et al. 2018)' and that the model is 'implemented with PyTorch', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For text detection, we use pretrained CRAFT (Baek et al. 2019b) model and ABCNet (Liu et al. 2020) model with 0.7 conﬁdence threshold. For OCR tokens that only appear in Rosetta results, we set default conﬁdence cdefault = 0.90. We set the max OCR number N = 50... The dimension of the common semantic space is d = 768. Generation Module uses 4 layers of transformers with 12 attention heads. The maximum number of decoding steps is set to 30... Common word ignoring threshold C of the repetition mask is set to 20. The model is trained on the Text Caps dataset for 12000 iterations. The initial learning rate is 1e-4. We multiply learning rate by 0.1 at 5000 and 7000 iterations separately.