MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond
Authors: Duy Kien Nguyen, Vedanuj Goswami, Xinlei Chen
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a series of experiments to validate the effectiveness of Mo Vie. By default, we use Adam (Kingma & Ba, 2015) optimizer, with batch size 128 and base learning rate 1e 4; momentum 0.9 and 0.98. We start training by linearly warming up learning rate from 2.5e 5 for 3 epochs (Yu et al., 2019). The rate is decayed by 0.1 after 10 epochs and we finish training after 13 epochs. |
| Researcher Affiliation | Industry | Duy-Kien Nguyen, Vedanuj Goswami, Xinlei Chen Facebook AI Research (FAIR) |
| Pseudocode | No | The paper provides architectural diagrams in Figure 2, but no explicit pseudocode or algorithm blocks. |
| Open Source Code | No | Code will be made available. |
| Open Datasets | Yes | Two datasets are used to counting with question queries. First is How Many-QA (Trott et al., 2018)... Extending How Many-QA, the Tally QA (Acharya et al., 2019) dataset... Results on COCO (Lin et al., 2014) are summarized in Tab. 3... Finally, to explore the capability of our model beyond counting, we evaluate Mo Vie on the CLEVR dataset (Johnson et al., 2017)... we also initiate an exploration of Mo Vie on the recent natural-image reasoning dataset, GQA (Hudson & Manning, 2019a). |
| Dataset Splits | Yes | First is How Many-QA (Trott et al., 2018) where the train set questions are extracted from VQA 2.0 train and Visual Genome (VG) (Krishna et al., 2017). The val and test sets are taken from VQA 2.0 val set. Extending How Many-QA, the Tally QA (Acharya et al., 2019) dataset augments the train set by adding synthetic counting questions automatically generated from COCO annotations. They also split the test set into two parts: testsimple and test-complex... We train all models on VQA 2.0 train and report the breakdown scores on val. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models. It only mentions the general software environment: "We use Pytorch to implement our model on a modular framework for vision and language multimodal research from Facebook AI Research (FAIR)." |
| Software Dependencies | No | The paper mentions using "Pytorch" and a "modular framework" but does not specify exact version numbers for these or any other software dependencies required for reproducibility. (Appendix A) |
| Experiment Setup | Yes | By default, we use Adam (Kingma & Ba, 2015) optimizer, with batch size 128 and base learning rate 1e 4; momentum 0.9 and 0.98. We start training by linearly warming up learning rate from 2.5e 5 for 3 epochs (Yu et al., 2019). The rate is decayed by 0.1 after 10 epochs and we finish training after 13 epochs. |