Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Authors: Zhaokai Wang, Renda Bao, Qi Wu, Si Liu2835-2843

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model outperforms state-of-the-art models on Text Caps dataset, improving from 81.0 to 93.0 in CIDEr. Our model significantly outperforms current state-of-the-art model of Text Caps dataset by 12.0 in CIDEr on test set, improving from 81.0 to 93.0.
Researcher Affiliation Collaboration 1 Beihang University, Beijing, China 2 Alibaba Group, Beijing, China 3 University of Adelaide, Australia
Pseudocode No The paper provides descriptions of its modules and equations but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Our source code is publicly available 1. 1https://github.com/wzk1015/CNMT
Open Datasets Yes We train our model on Text Caps dataset, and evaluate its performance on validation set and test set. Sidorov et al. has introduced Text Caps (Sidorov et al. 2020) dataset.
Dataset Splits Yes We train our model on Text Caps dataset, and evaluate its performance on validation set and test set. At every 500 iterations we compute the BLEU-4 metric on validation set, and select the best model based on all of them.
Hardware Specification Yes The entire training takes approximately 12 hours on 4 RTX 2080 Ti GPUs.
Software Dependencies No The paper mentions 'BERT-BASE (Devlin et al. 2018)' and that the model is 'implemented with PyTorch', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For text detection, we use pretrained CRAFT (Baek et al. 2019b) model and ABCNet (Liu et al. 2020) model with 0.7 confidence threshold. For OCR tokens that only appear in Rosetta results, we set default confidence cdefault = 0.90. We set the max OCR number N = 50... The dimension of the common semantic space is d = 768. Generation Module uses 4 layers of transformers with 12 attention heads. The maximum number of decoding steps is set to 30... Common word ignoring threshold C of the repetition mask is set to 20. The model is trained on the Text Caps dataset for 12000 iterations. The initial learning rate is 1e-4. We multiply learning rate by 0.1 at 5000 and 7000 iterations separately.