Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
Authors: Adyasha Maharana, Amita Kamath, Christopher Clark, Mohit Bansal, Aniruddha Kembhavi
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a solution, we introduce a benchmark dataset, Coco Con, where we create contrast sets by modifying test instances for multiple tasks... We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks... To alleviate this issue, we propose a rank correlation-based auxiliary training objective... Data and code are available at https://adymaharana.github.io/cococon/. We evaluate two recent GPV models, Unified-IO (Lu et al., 2022) and OFA (Wang et al., 2022)... We show that cross-task inconsistency is a surprisingly significant phenomenon in these models... Our experiments show that continued training of models using this auxiliary consistency-based objective can lead to consistency improvements when evaluated on Coco Con while preserving or improving the accuracy of the model on the original test sets. |
| Researcher Affiliation | Collaboration | Adyasha Maharana EMAIL University of North Carolina, Chapel Hill Amita Kamath EMAIL University of California, Los Angeles Christopher Clark EMAIL Allen Institute for AI Mohit Bansal EMAIL University of North Carolina, Chapel Hill Aniruddha Kembhavi EMAIL Allen Institute for AI |
| Pseudocode | Yes | Algorithm 1 Cross-Task Consistency-based Training |
| Open Source Code | Yes | Data and code are available at https://adymaharana.github.io/cococon/. |
| Open Datasets | Yes | As a solution, we introduce a benchmark dataset, Coco Con, where we create contrast sets... Data and code are available at https://adymaharana.github.io/cococon/. The COCO dataset (Lin et al., 2014) contains annotations for many tasks in vision and language, which makes it very suitable for evaluating cross-task consistency in a multimodal model. |
| Dataset Splits | Yes | Coco Con is created from the validation splits of the four tasks i.e. image captioning (anchor task), VQA (Antol et al., 2015; Goyal et al., 2017), localization, and text-to-image generation. The Coco Con dataset contains 4789 contrast sets for 1,500 samples from the COCO validation split, with an average of 3.2 contrast sets per sample. |
| Hardware Specification | No | The paper does not explicitly state specific hardware details such as GPU models, CPU types, or memory used for running the experiments. It refers to various models (Unified-IO, OFA, Kosmos-2, GILL) and training hyperparameters but lacks hardware specifications. |
| Software Dependencies | No | The paper does not explicitly provide specific software dependencies with version numbers. It mentions using specific models like Unified-IO and OFA, and language models like T5, but without version details. |
| Experiment Setup | Yes | We set λ = 0.25 and use a learning rate of 1e-6. Additional hyperparameters can be found in Appendix C. Table 5: Hyperparameters for training OFACon. Hyperparameter Value Proportion of ranking updates (γ) 0.5 Weight co-efficient of ranking loss (λ) 0.25 Regularization strength of soft ranking 1.0 Learning rate 1e-6 Max. train epochs 1 Batch Size 2 Warmup ratio 0.1 Label smoothing 0.0 |