Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Authors: Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental findings show KAM-Co T outperforms the state-of-the-art methods. On the Science QA dataset, we achieve an average accuracy of 93.87%, surpassing GPT-3.5 (75.17%) by 18% and GPT-4 (83.99%) by 10%. Remarkably, KAM-Co T achieves these results with only 280M trainable parameters at a time, demonstrating its cost-efficiency and effectiveness. |
| Researcher Affiliation | Industry | Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, Godawari Sudhakar Rao Samsung R&D Institute India Bangalore EMAIL |
| Pseudocode | Yes | Algorithm 1: KAM-Co T Reasoning Input: Language features Xrat lang, Image features Ximg, and Graph features Xkg Output: Rationale r, Answer a 1: Construct input X = {Xrat lang, Ximg, Xkg} 2: r Frat(X) 3: Concatenate r to Xrat lang, to make Xans lang [Xrat lang; r] 4: Construct new input X = {Xans lang, Ximg, Xkg} 5: a Fans(X ) 6: procedure F(X) 7: Get the encoded representations, Hlang, Himg, and Hkg 8: Obtain the feature representations, Hattn img, and Hattn kg 9: Fuse these representations with Hlang to obtain Hfuse 10: Input Hfuse to the decoder to get the target Y 11: return Y 12: end procedure |
| Open Source Code | No | The paper does not provide an explicit statement or link for the source code of the methodology described in this paper. |
| Open Datasets | Yes | We evaluate our method on the Science QA benchmark (Lu et al. 2022). |
| Dataset Splits | Yes | Science QA provides us with an in-house training, dev and test split containing 12726, 4241 and 4241 samples respectively. |
| Hardware Specification | Yes | All our experiments are run on a single NVIDIA A100 40G GPU. |
| Software Dependencies | No | The paper mentions various software components and models such as Py Torch Geometric (Fey and Lenssen 2019), T5-Base (Raffel et al. 2020), FLAN-T5-Base (Chung et al. 2022), CLIP (Radford et al. 2021), DETR (Carion et al. 2020), and Vi T-GPT2, citing their associated papers. However, it does not provide specific version numbers for the software libraries or frameworks used (e.g., 'PyTorch 1.9' or 'HuggingFace Transformers 4.x.x'). |
| Experiment Setup | Yes | We train our models for 20 epochs, and also evaluate them after each, with Science QA s dev split. We use a learning rate of 5e-5 and batch-size of 1, a maximum input length of 512 tokens, and maximum output length of 512 and 64 tokens for rationale and answer generation respectively. |