MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

Authors: Duy Kien Nguyen, Vedanuj Goswami, Xinlei Chen

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a series of experiments to validate the effectiveness of Mo Vie. By default, we use Adam (Kingma & Ba, 2015) optimizer, with batch size 128 and base learning rate 1e 4; momentum 0.9 and 0.98. We start training by linearly warming up learning rate from 2.5e 5 for 3 epochs (Yu et al., 2019). The rate is decayed by 0.1 after 10 epochs and we finish training after 13 epochs.
Researcher Affiliation Industry Duy-Kien Nguyen, Vedanuj Goswami, Xinlei Chen Facebook AI Research (FAIR)
Pseudocode No The paper provides architectural diagrams in Figure 2, but no explicit pseudocode or algorithm blocks.
Open Source Code No Code will be made available.
Open Datasets Yes Two datasets are used to counting with question queries. First is How Many-QA (Trott et al., 2018)... Extending How Many-QA, the Tally QA (Acharya et al., 2019) dataset... Results on COCO (Lin et al., 2014) are summarized in Tab. 3... Finally, to explore the capability of our model beyond counting, we evaluate Mo Vie on the CLEVR dataset (Johnson et al., 2017)... we also initiate an exploration of Mo Vie on the recent natural-image reasoning dataset, GQA (Hudson & Manning, 2019a).
Dataset Splits Yes First is How Many-QA (Trott et al., 2018) where the train set questions are extracted from VQA 2.0 train and Visual Genome (VG) (Krishna et al., 2017). The val and test sets are taken from VQA 2.0 val set. Extending How Many-QA, the Tally QA (Acharya et al., 2019) dataset augments the train set by adding synthetic counting questions automatically generated from COCO annotations. They also split the test set into two parts: testsimple and test-complex... We train all models on VQA 2.0 train and report the breakdown scores on val.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models. It only mentions the general software environment: "We use Pytorch to implement our model on a modular framework for vision and language multimodal research from Facebook AI Research (FAIR)."
Software Dependencies No The paper mentions using "Pytorch" and a "modular framework" but does not specify exact version numbers for these or any other software dependencies required for reproducibility. (Appendix A)
Experiment Setup Yes By default, we use Adam (Kingma & Ba, 2015) optimizer, with batch size 128 and base learning rate 1e 4; momentum 0.9 and 0.98. We start training by linearly warming up learning rate from 2.5e 5 for 3 epochs (Yu et al., 2019). The rate is decayed by 0.1 after 10 epochs and we finish training after 13 epochs.