Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Robust Multi-Modal Reasoning via Model Selection
Authors: Xiangyan Liu, Rongxue LI, Wei Ji, Tao Lin
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the absence of suitable benchmarks, we create MS-GQA, a new dataset specifically designed to investigate the model selection challenge in multi-modal agents. Our experiments reveal that our framework enables dynamic model selection, considering both user inputs and subtask dependencies, thereby robustifying the overall reasoning process. Our code and benchmark: https://github.com/LINs-lab/M3. 4 EXPERIMENTS |
| Researcher Affiliation | Academia | Xiangyan Liu3, Rongxue Li2,1, Wei Ji3 Tao Lin1, EMAIL; EMAIL; EMAIL; EMAIL 1Westlake University 2Zhejiang University 3National University of Singapore |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and benchmark: https://github.com/LINs-lab/M3. |
| Open Datasets | Yes | As our side contribution, we introduce the first benchmark, MSGQA (Model Selection in GQA (Hudson & Manning, 2019)), to explore the model selection methods on multi-modal reasoning scenarios. Our code and benchmark: https://github.com/LINs-lab/M3. |
| Dataset Splits | Yes | The dataset from MS-GQA is split randomly into training, validation, and test sets, with a 6 : 2 : 2 ratio. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions software components and models like 'CCE loss', 'NCF', 'METAGL', 'GAT', 'blip-base-vqa', and 'bert-base-uncased', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Specifically, we explored hidden sizes [16, 32, 64, 128], learning rates [1e-2, 5e-3, 1e-3, 5e-3, 1e-4], weight decays [0.01, 0.001, 0.0001], and optimizer options [Adam W, Adam, SGD]. A batch size of 64 is utilized, along with Step LR Scheduler with parameters step size 100 and gamma 0.7. The learning rate is adjusted within [1e-2, 5e-3, 1e-3, 5e-3, 1e-4], with a majority of the experiments using 1e-3. The weight decay is set to 0, and the batch size is set to 128. |