Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective
Authors: Emmanuelle Salin, Badreddine Farah, Stéphane Ayache, Benoit Favre11248-11257
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. Although the results confirm the ability of Vision-Language models to understand color at a multimodal level, the models seem to prefer relying on bias in text data for object position and size. |
| Researcher Affiliation | Academia | Emmanuelle Salin1, Badreddine Farah2, St ephane Ayache1, Benoit Favre1 1Aix Marseille Univ, Universit e de Toulon, CNRS, LIS, Marseille, France 2 Ecole Sup Galil ee, Universit e Sorbonne Paris Nord, France |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make all datasets and code available to replicate experiments. We make the set of monomodal and multimodal probing tasks, as well as all software developed for this study, available for further research1. 1https://github.com/ejsalin/vlm-probing |
| Open Datasets | Yes | We choose already existing language probing tasks and adapt them to a subset of 3,000 instances from Flickr30k (Young et al. 2014). We use the 102Flower dataset (Nilsback and Zisserman 2008). We build this object counting task on a subset of 3,000 instances of the MS-COCO dataset (Lin et al. 2014). We make all datasets and code available to replicate experiments. |
| Dataset Splits | No | Table 1 provides 'Total Instances' and 'Test Size' for each probing task, implicitly defining a train/test split. However, there is no explicit mention of a validation split percentage or size for the probing tasks. For fine-tuning, it states 'we finetune the models using authors instructions' or 'use the available checkpoints', without detailing the splits. |
| Hardware Specification | No | We trained the models on a cuda75-capable GPU. This refers to the compute capability but not a specific GPU model. The acknowledgments mention 'HPC resources from GENCI IDRIS' but do not provide specific hardware details. |
| Software Dependencies | No | The paper states, 'We use the pre-trained models from Pytorch (Paszke et al. 2019) and Hugging Face (Wolf et al. 2020) for the experiments,' but does not provide specific version numbers for PyTorch or the Hugging Face library. |
| Experiment Setup | Yes | The probing model PM is a linear model trained over 30 epochs for M, V and L-BShift tasks and 50 epochs for L-Tagging, with a learning rate of 0.001. We used MSE loss to train PMV-Obj Count and report RMSE as a metric to evaluate V-Obj Count, and the cross entropy loss for all other probing tasks, with accuracy as metric. The results of each probing task are averaged over 5 runs. |