Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UMP-Net: Uncertainty-Aware Mixture of Prompts Network for Efficient Instruction Tuning

Authors: Fatemeh Daneshfar, Abdulhady abas, Moloud Abdar, Pietro Lio

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated UMP-Net on a range of benchmarks including Science QA, COCO Caption, and various zero-shot multi-modal tasks. The results show a strong performance: an average accuracy of 88.41% on Science QA and a CIDEr score of 158.3 on COCO Caption, surpassing models such as LLa VA, LLa MA-Adapter, and LLa MA-Excitor. These findings suggest that UMPNet offers both improved multi-modal capability and computational efficiency. Further ablations demonstrate UMP-Net s conformal prediction module provides robust uncertainty estimates under noise and domain shifts, outperforming Bayesian alternatives in coverage guarantees with minimal overhead.
Researcher Affiliation	Academia	Fatemeh Daneshfar EMAIL Department of Computer Engineering, University of Kurdistan, Sanandaj, IRAN Abdulhady Abas Abdullah EMAIL Artificial Intelligence and Innovation Centre, University of Kurdistan Hewler, Erbil, Iraq Moloud Abdar EMAIL CHIRP, Child Health Research Centre, The University of Queensland, Brisbane, Australia Pietro Liò EMAIL Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
Pseudocode	Yes	Algorithm 1 UMP-Net Algorithm for LLa MA Adaptation
Open Source Code	Yes	The code of our proposed model is available here: https://github.com/abdulhadyabas2/UMP-Net Uncertainty.
Open Datasets	Yes	We evaluated UMP-Net on a range of benchmarks including Science QA, COCO Caption, and various zero-shot multi-modal tasks. The results show a strong performance: an average accuracy of 88.41% on Science QA and a CIDEr score of 158.3 on COCO Caption, surpassing models such as LLa VA, LLa MA-Adapter, and LLa MA-Excitor. We test the performance of UMP-Net by extensive benchmark testing Systems like Science QA Lu et al. (2022a), COCO Caption Chen et al. (2015), and a spectrum of zero-shot multi-modal tasks. We evaluated our model on the COCO Caption dataset Chen et al. (2015), which comprises 0.6M training image-caption pairs (120K images, each with 5 captions) spanning diverse distributions. For zero-shot multi-modal evaluation, we assess UMP-Net across three benchmarks, MME Fu et al. (2023), MMBench Liu et al. (2023c), and LVLM-e Hub Xu et al. (2023), covering diverse visual question-answering (VQA) tasks.
Dataset Splits	Yes	Following the Stanford Alpaca Taori et al. (2023a), we employ a data set of 52K instruction-following examples for training purposes. We evaluated our model on the COCO Caption dataset Chen et al. (2015), which comprises 0.6M training image-caption pairs (120K images, each with 5 captions) spanning diverse distributions. We evaluate UMP-Net on the Science QA dataset Lu et al. (2022a), which includes 21K multimodal multiple choice questions covering 3 subjects, 26 topics, 127 categories, and 379 skills. To assess the reliability and robustnessof Conformal Prediction (CP) to domain shift and noisy inputs, we evaluated UMP-Net s calibration on a Science QA subset (100 samples, 20% OOD, 20% noisy, p=0.2), targeting 90% coverage (1-α=0.90) Vovk et al. (2005); Zou et al. (2024).
Hardware Specification	Yes	The UMP-Net model is fine-tuned using 2 RTX 4090 GPUs over 4 epochs. Each of these measurements was taken on one training with mixed-precision (FP16), batch size of 8, and NVIDIA RTX 4090 (24GB VRAM). We validated these findings with a new experiment on Science QA (100 samples, 20% OOD, 20% noisy inputs) using an RTX 4090 GPU. Table 25: Inference cost (mean sd) for UMP-Net vs. LLa MA-Adapter. % = (UMP-Net LLa MAAdapter) / LLa MA-Adapter. GPU Task B L Img Res Latency (ms) Tokens/s Images/s VRAM (GB) FLOPs (T) Latency% 4090 Text 1 512 1300 40 101.2 3.0 14.4 0.81 +4.8% 4090 VL 1 512 336 1740 55 78.3 2.1 11.0 0.4 16.7 1.21 +10.1% A100 Text 8 1024 1930 60 505 12 28.1 4.48 +4.3% A100 VL 8 1024 336 2600 85 380 11 39.0 1.2 33.0 6.20 +10.2%
Software Dependencies	No	The paper mentions "PyTorch's torch.utils.benchmark" but does not specify a version number for it or any other software components used in the implementation or experimentation.
Experiment Setup	Yes	The UMP-Net model is fine-tuned using 2 RTX 4090 GPUs over 4 epochs. We configure the training with two warmup epochs, a batch size of 8, a learning rate of 0.009, and a weight decay of 0.02. By default, we utilize the LLa MA-Adapter Zhang et al. (2024) pre-trained for version LLa MA2 7B and the foundation pre-trained LLa MA model with 8B version LLa MA3 parameters and N = 32 transformer layers. The prompt length is set to dp = 40, and the adaptation prompts are integrated into the final M = 30 layers of the model.