KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Authors: Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental findings show KAM-Co T outperforms the state-of-the-art methods. On the Science QA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-Co T achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness. |
| Researcher Affiliation | Industry | Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao Samsung R&D Institute India Bangalore {d.mondal, suraj.modi, subha.darshi, rituraj.s, g.sudhakar}@samsung.com |
| Pseudocode | Yes | Algorithm 1: KAM-Co T Reasoning Input: Language features Xrat lang, Image features Ximg, and Graph features Xkg Output: Rationale r, Answer a 1: Construct input X = {Xrat lang, Ximg, Xkg} 2: r Frat(X) 3: Concatenate r to Xrat lang, to make Xans lang [Xrat lang; r] 4: Construct new input X = {Xans lang, Ximg, Xkg} 5: a Fans(X ) 6: procedure F(X) 7: Get the encoded representations, Hlang, Himg, and Hkg 8: Obtain the feature representations, Hattn img, and Hattn kg 9: Fuse these representations with Hlang to obtain Hfuse 10: Input Hfuse to the decoder to get the target Y 11: return Y 12: end procedure |
| Open Source Code | No | The paper does not provide an explicit statement or link for the source code of the methodology described in this paper. |
| Open Datasets | Yes | We evaluate our method on the Science QA benchmark (Lu et al. 2022). |
| Dataset Splits | Yes | Science QA provides us with an in-house training, dev and test split containing 12726, 4241 and 4241 samples respectively. |
| Hardware Specification | Yes | All our experiments are run on a single NVIDIA A100 40G GPU. |
| Software Dependencies | No | The paper mentions various software components and models such as Py Torch Geometric (Fey and Lenssen 2019), T5-Base (Raffel et al. 2020), FLAN-T5-Base (Chung et al. 2022), CLIP (Radford et al. 2021), DETR (Carion et al. 2020), and Vi T-GPT2, citing their associated papers. However, it does not provide specific version numbers for the software libraries or frameworks used (e.g., 'PyTorch 1.9' or 'HuggingFace Transformers 4.x.x'). |
| Experiment Setup | Yes | We train our models for 20 epochs, and also evaluate them after each, with Science QA s dev split. We use a learning rate of 5e-5 and batch-size of 1, a maximum input length of 512 tokens, and maximum output length of 512 and 64 tokens for rationale and answer generation respectively. |