Chop Chop BERT: Visual Question Answering by Chopping VisualBERT’s Heads

Authors: Chenyu Gao, Qi Zhu, Peng Wang, Qi Wu

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As shown in the interesting echelon shape of the result matrices, experiments reveal different heads and layers are responsible for different question types, with higher-level layers activated by higher-level visual reasoning questions. Our experiments based on the Visual BERT, as for it s general Transformer style architecture without more extra designs.
Researcher Affiliation Academia 1School of Computer Science, Northwestern Polytechnical University, Xi an, China 2School of Software, Northwestern Polytechnical University, Xi an, China 3National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, China 4University of Adelaide, Australia
Pseudocode No The paper describes its methods but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets Yes All of our experiments are conducted on the Task Driven Image Understanding Challenge (TDIUC) [Kafle and Kanan, 2017a] dataset, a large VQA dataset. This dataset was proposed to compensate for the bias in distribution of different question types of VQA 2.0 [Goyal et al., 2017].
Dataset Splits No The paper mentions using the TDIUC dataset and fine-tuning but does not explicitly provide details about training, validation, and test dataset splits with percentages or counts.
Hardware Specification Yes Experiments are conducted on 4 NVIDIA Ge Force 2080Ti GPUs with a batch size of 480.
Software Dependencies No The paper mentions 'Py Torch' but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes We load the model pre-trained on COCO Caption [Chen et al., 2015] dataset, then finetune it with a leaning rate of 5e 5 on the TDIUC dataset. The maximal learning rate is 1e 3 and the batch size is 480.