Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
Authors: Zihao Zhu, Jing Yu, Yujing Wang, Yajing Sun, Yue Hu, Qi Wu
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments.4 Experiments Dataset. We evaluate Mucko on the FVQA dataset [Wang et al., 2018]. It consists of 2,190 images, 5,286 questions and a knowledge base of 193,449 facts. Facts are constructed by extracting top visual concepts in the dataset and querying these concepts in Web Child, Concept Net and DBPedia. Evaluation Metrics. We follow the metrics in [Wang et al., 2018] to evaluate the performance. The top-1 and top-3 accuracy is calculated for each method. The averaged accuracy of 5 test splits is reported as the overall accuracy. Implementation Details. |
| Researcher Affiliation | Collaboration | Zihao Zhu1,2 , Jing Yu1,2 , Yujing Wang3 , Yajing Sun1,2 , Yue Hu1,2 and Qi Wu 4 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China 3Microsoft Research Asia, Beijing, China 4University of Adelaide, Australia |
| Pseudocode | No | The paper describes the methodology using text and mathematical equations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/astro-zihao/mucko. |
| Open Datasets | Yes | Dataset. We evaluate Mucko on the FVQA dataset [Wang et al., 2018]. |
| Dataset Splits | No | The paper states it evaluates on the FVQA dataset and reports results based on '5 test splits' and training details, but it does not provide explicit information about the specific training/validation/test dataset splits (e.g., percentages or sample counts) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions using Faster-RCNN, GloVe embeddings, LSTM, and Adam optimizer but does not specify version numbers for any of the software dependencies. |
| Experiment Setup | Yes | Implementation Details. We select the top-10 dense captions according to their confidence. The max sentence length of dense captions and the questions is set to 20. The hidden state size of all the LSTM blocks is set to 512. We set a = 0.7 and b = 0.3 in the binary cross-entropy loss. Our model is trained by Adam optimizer with 20 epochs, where the minibatch size is 64 and the dropout ratio is 0.5. Warm up strategy is applied for 2 epochs with initial learning rate 1 10 3 and warm-up factor 0.2. Then we use cosine annealing learning strategy with initial learning rate ηmax = 1 10 3 and termination learning rate ηmin = 3.6 10 4 for the rest epochs. |