Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Authors: Mustafa Shukor, Corentin Dancette, Alexandre Rame, Matthieu Cord
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Un IVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are available at: https://github.com/mshukor/UnIVAL. ... In this section, we present the experimental results of Un IVAL following different setups; finetuning on downstream datasets and direct evaluation without finetuning (e.g. zero-shot). |
| Researcher Affiliation | Collaboration | Mustafa Shukor EMAIL Sorbonne University MLIA, ISIR, Paris, France Corentin Dancette EMAIL Sorbonne University MLIA, ISIR, Paris, France Alexandre Rame EMAIL Sorbonne University MLIA, ISIR, Paris, France Matthieu Cord EMAIL Sorbonne University, MLIA, ISIR, Paris, France Valeo.ai, Paris, France |
| Pseudocode | No | The paper describes the model architecture and training procedures in detail within the main text and appendices (e.g., Section 3, Section D, Section G), but it does not present any explicitly labeled pseudocode blocks or algorithms with structured steps. |
| Open Source Code | Yes | The model weights and code are available at: https://github.com/mshukor/UnIVAL. |
| Open Datasets | Yes | For pretraining tasks, we pretrain only on relatively small public datasets, such as image captioning (COCO (Lin et al., 2014), Visual Genome (VG) (Krishna et al., 2017b), SBU (Ordonez et al., 2011), CC3M (Sharma et al., 2018) and CC12M (Changpinyo et al., 2021) (only in the first stage)), VQA (VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), VG (Krishna et al., 2017b)), Visual Grounding (VGround) and referring expression comprehension (Ref COCO, Ref COCO+, Ref COCOg (Yu et al., 2016)), video captioning (Web Vid2M (Bain et al., 2021)) and video question answering (Web Vid QA (Yang et al., 2021a)). Note that we only use the training sets during pretraining. |
| Dataset Splits | Yes | Table 15: Downstream tasks and datasets. We show the size of different splits used in our work. ... COCO (Lin et al., 2014) Image-Text Image Captioning 113K/5K/5K ... VQAv2 (Goyal et al., 2017) Image-Text VQA 443K/214K/453K ... Ref COCO (Yu et al., 2016) Image-Text Visual Grounding 120K/6K/5K |
| Hardware Specification | Yes | This work was supprted by HPC resources of CINES and GENCI. The authors would like to thank the staff of CINES for technical support in managing the Adastra GPU cluster, in particular; Jean-Christophe Penalva, Johanne Charpentier, Mathieu Cloirec, Jerome Castaings, Gérard Vernou, Bertrand Cirou and José Ricardo Kouakou. ... During inference (on 1 GPU AMD Instinct MI250X with batch size of 1), the model requires 1.2GB of GPU memory and 0.015 seconds to generate a COCO caption. |
| Software Dependencies | No | The paper mentions using BART-base for initialization (Lewis et al., 2020), Adam optimizer, and specific pre-trained encoders (Res Net-101, 3D Res Next-101, PANN encoder). However, it does not specify version numbers for general software libraries or frameworks like Python, PyTorch, or TensorFlow, which are essential for replication. |
| Experiment Setup | Yes | Implementation details for pretraining. The architecture of the LM is a typical encoder-decoder transformer initialized by BART-base (Lewis et al., 2020) with few modifications... We use Adam optimizer with weight decay 0.01 and linear decay scheduler for the learning rate starting from 2e 4. All model parameters are pretrained in 2 stages; first we train only on image-text tasks for 150k steps and batch size 3200, then we add video-text tasks and continue training (after removing CC12M) for 90K steps with batch size 4K (2k for each modality). ... For image-text tasks, we keep the hyperparameters during finetuning close to those in OFA (Wang et al., 2022c). ... VQA. We finetune on VQAv2 dataset and cast the task as text generation. The model is trained for 5 epochs with a batch size of 256 using Adam optimizer. We use a learning rate of 1e 4 with linear decay and label smoothing of 0.1. The image resolution is increased to 480 and we use exponential moving average with 0.9999 decay. We use Trie based search to constraint the generated answers to the top 3.1k answers. We freeze the encoder and decoder embeddings during finetuning. The question is passed to both the encoder and decoder as prompt. |