reproducibilityindex.ai

Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

Authors: Liqi He, Zuchao Li, Xiantao Cai, Ping Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments on the Science QA benchmark, which contains questions that require reasoning based on provided text and images. The results show that our proposed latent space learning is effective in generating useful chain of thought (Co T) and inferring correct answers. We achieved new state-of-the-art results on the Science QA benchmark with about only 1 billion parameters, outperforming the current SOTA baseline by 6.06% (base), 1.67% (large) respectively, and the strong Chat GPT system by 18.18% with less than 1/100th of the parameters, demonstrating the effectiveness of our approach.
Researcher Affiliation	Academia	Liqi He1, , Zuchao Li1, *, Xiantao Cai1, Ping Wang2,3 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, China 2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China 3School of Information Management, Wuhan University, Wuhan 430072, China
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code will be released at https://github. com/shimurenhlq/DPMM-COT.
Open Datasets	Yes	To assess Co T on LLMs, we followed the approach of MM-Co T (Zhang et al. 2023b) and used the Science Question Answering (Science QA) (Lu et al. 2022) dataset. In addition, we also conducted experiments on the Multi30K multi-modal translation dataset (Elliott et al. 2016) and followed the work of IKD-MMT (Peng, Zeng, and Zhao 2022).
Dataset Splits	No	The paper mentions using Science QA and Multi30K datasets but does not explicitly provide training/validation/test dataset splits with percentages, sample counts, or a detailed splitting methodology in the main text.
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments (e.g., specific GPU/CPU models, memory, or cloud instances).
Software Dependencies	No	The paper mentions "T5 encoder-decoder architecture (Raffel et al. 2020)" and "m T5large (Xue et al. 2021)" as models, but does not provide specific software dependencies with version numbers (e.g., Python version, PyTorch version, etc.).
Experiment Setup	No	The paper mentions using a "two-stage framework consisting of two procedures: rationale generation and answer inference. Both stages shared the same model architecture, namely the T5 encoder-decoder architecture (Raffel et al. 2020)." However, it does not provide specific experimental setup details such as concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings.