KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning

Authors: Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental findings show KAM-Co T outperforms the state-of-the-art methods. On the Science QA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-Co T achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness.
Researcher Affiliation Industry Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao Samsung R&D Institute India Bangalore {d.mondal, suraj.modi, subha.darshi, rituraj.s, g.sudhakar}@samsung.com
Pseudocode Yes Algorithm 1: KAM-Co T Reasoning Input: Language features Xrat lang, Image features Ximg, and Graph features Xkg Output: Rationale r, Answer a 1: Construct input X = {Xrat lang, Ximg, Xkg} 2: r Frat(X) 3: Concatenate r to Xrat lang, to make Xans lang [Xrat lang; r] 4: Construct new input X = {Xans lang, Ximg, Xkg} 5: a Fans(X ) 6: procedure F(X) 7: Get the encoded representations, Hlang, Himg, and Hkg 8: Obtain the feature representations, Hattn img, and Hattn kg 9: Fuse these representations with Hlang to obtain Hfuse 10: Input Hfuse to the decoder to get the target Y 11: return Y 12: end procedure
Open Source Code No The paper does not provide an explicit statement or link for the source code of the methodology described in this paper.
Open Datasets Yes We evaluate our method on the Science QA benchmark (Lu et al. 2022).
Dataset Splits Yes Science QA provides us with an in-house training, dev and test split containing 12726, 4241 and 4241 samples respectively.
Hardware Specification Yes All our experiments are run on a single NVIDIA A100 40G GPU.
Software Dependencies No The paper mentions various software components and models such as Py Torch Geometric (Fey and Lenssen 2019), T5-Base (Raffel et al. 2020), FLAN-T5-Base (Chung et al. 2022), CLIP (Radford et al. 2021), DETR (Carion et al. 2020), and Vi T-GPT2, citing their associated papers. However, it does not provide specific version numbers for the software libraries or frameworks used (e.g., 'PyTorch 1.9' or 'HuggingFace Transformers 4.x.x').
Experiment Setup Yes We train our models for 20 epochs, and also evaluate them after each, with Science QA s dev split. We use a learning rate of 5e-5 and batch-size of 1, a maximum input length of 512 tokens, and maximum output length of 512 and 64 tokens for rationale and answer generation respectively.