ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training
Authors: Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, Francesco Locatello
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Empirical Evidence In the following we compare ASIF to traditional multimodal models based on contrastive training, CLIP and Li T. We then take a closer look to the classification of a single image, unpacking the relative representations and following the classification algorithm step by step. |
| Researcher Affiliation | Academia | Antonio Norelli Marco Fumero Valentino Maiorca Luca Moschella Emanuele Rodolà Francesco Locatello Sapienza Università di Roma, dipartimento di Informatica Institute of Science and Technology Austria (ISTA) |
| Pseudocode | No | The paper provides a detailed step-by-step 'ASIF recipe' for its procedure, structured as a numbered list. However, it is not explicitly labeled as 'Pseudocode' or an 'Algorithm' block. |
| Open Source Code | Yes | Correspondence to Antonio Norelli <norelli@di.uniroma1.it>. A demo sufficient to reproduce the main results in the paper within minutes, even from smartphone, can be found here: https://github.com/noranta4/ASIF |
| Open Datasets | Yes | We employed the first 1.6M entries of the Conceptual Caption dataset (CC12M, [49]) as our multimodal dataset. ... For our primary experiment, we utilized vision transformers as image encoders, pretrained either in a supervised manner (DEIT base, [44]) or in an unsupervised manner (DINO VITs8, [45]), on Imagenet 1k [46] and 21k [47] respectively. ... We assessed the quality of our ASIF multimodal model by comparing its zero-shot classification performance against CLIP and Li T on four datasets: CIFAR100, Imagenet, Imagenetv2, and PETS [46, 50 52]; see Table 1. |
| Dataset Splits | Yes | We used a subset of the Image Net validation set to tune the two hyperparameters of ASIF which were then used on the other data sets. ... As a further experiment, we randomly selected 100 images from the Euro SAT dataset and incorporated them into our ASIF training set, raising the total to 1,500,100 image-text pairs and leaving 26,900 images for testing. |
| Hardware Specification | Yes | To optimize performance on a single Tesla T4 GPU, we limited our analysis to the initial 1.6M pairs. |
| Software Dependencies | No | The paper mentions using specific models and architectures such as 'Sentence T transformer', 'DEIT base', and 'DINO VITs8', but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or the implementations of these models. |
| Experiment Setup | Yes | We tuned k and p on the Image Net validation set, in both cases we used k = 800 and p = 8. ... Besides the pivotal choice of the ground-truth multimodal pairs, the number of non-zero elements k and the exponent p are the salient hyperparameters to consider when deploying an ASIF model. |