Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

Authors: Adyasha Maharana, Amita Kamath, Christopher Clark, Mohit Bansal, Aniruddha Kembhavi

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As a solution, we introduce a benchmark dataset, Coco Con, where we create contrast sets by modifying test instances for multiple tasks... We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks... To alleviate this issue, we propose a rank correlation-based auxiliary training objective... Data and code are available at https://adymaharana.github.io/cococon/. We evaluate two recent GPV models, Unified-IO (Lu et al., 2022) and OFA (Wang et al., 2022)... We show that cross-task inconsistency is a surprisingly significant phenomenon in these models... Our experiments show that continued training of models using this auxiliary consistency-based objective can lead to consistency improvements when evaluated on Coco Con while preserving or improving the accuracy of the model on the original test sets.
Researcher Affiliation	Collaboration	Adyasha Maharana EMAIL University of North Carolina, Chapel Hill Amita Kamath EMAIL University of California, Los Angeles Christopher Clark EMAIL Allen Institute for AI Mohit Bansal EMAIL University of North Carolina, Chapel Hill Aniruddha Kembhavi EMAIL Allen Institute for AI
Pseudocode	Yes	Algorithm 1 Cross-Task Consistency-based Training
Open Source Code	Yes	Data and code are available at https://adymaharana.github.io/cococon/.
Open Datasets	Yes	As a solution, we introduce a benchmark dataset, Coco Con, where we create contrast sets... Data and code are available at https://adymaharana.github.io/cococon/. The COCO dataset (Lin et al., 2014) contains annotations for many tasks in vision and language, which makes it very suitable for evaluating cross-task consistency in a multimodal model.
Dataset Splits	Yes	Coco Con is created from the validation splits of the four tasks i.e. image captioning (anchor task), VQA (Antol et al., 2015; Goyal et al., 2017), localization, and text-to-image generation. The Coco Con dataset contains 4789 contrast sets for 1,500 samples from the COCO validation split, with an average of 3.2 contrast sets per sample.
Hardware Specification	No	The paper does not explicitly state specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It refers to various models (Unified-IO, OFA, Kosmos-2, GILL) and training hyperparameters but lacks hardware specifications.
Software Dependencies	No	The paper does not explicitly provide specific software dependencies with version numbers. It mentions using specific models like Unified-IO and OFA, and language models like T5, but without version details.
Experiment Setup	Yes	We set λ = 0.25 and use a learning rate of 1e-6. Additional hyperparameters can be found in Appendix C. Table 5: Hyperparameters for training OFACon. Hyperparameter Value Proportion of ranking updates (γ) 0.5 Weight co-efficient of ranking loss (λ) 0.25 Regularization strength of soft ranking 1.0 Learning rate 1e-6 Max. train epochs 1 Batch Size 2 Warmup ratio 0.1 Label smoothing 0.0