ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training

Authors: Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, Francesco Locatello

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 Empirical Evidence In the following we compare ASIF to traditional multimodal models based on contrastive training, CLIP and Li T. We then take a closer look to the classification of a single image, unpacking the relative representations and following the classification algorithm step by step.
Researcher Affiliation Academia Antonio Norelli Marco Fumero Valentino Maiorca Luca Moschella Emanuele Rodolà Francesco Locatello Sapienza Università di Roma, dipartimento di Informatica Institute of Science and Technology Austria (ISTA)
Pseudocode No The paper provides a detailed step-by-step 'ASIF recipe' for its procedure, structured as a numbered list. However, it is not explicitly labeled as 'Pseudocode' or an 'Algorithm' block.
Open Source Code Yes Correspondence to Antonio Norelli <norelli@di.uniroma1.it>. A demo sufficient to reproduce the main results in the paper within minutes, even from smartphone, can be found here: https://github.com/noranta4/ASIF
Open Datasets Yes We employed the first 1.6M entries of the Conceptual Caption dataset (CC12M, [49]) as our multimodal dataset. ... For our primary experiment, we utilized vision transformers as image encoders, pretrained either in a supervised manner (DEIT base, [44]) or in an unsupervised manner (DINO VITs8, [45]), on Imagenet 1k [46] and 21k [47] respectively. ... We assessed the quality of our ASIF multimodal model by comparing its zero-shot classification performance against CLIP and Li T on four datasets: CIFAR100, Imagenet, Imagenetv2, and PETS [46, 50 52]; see Table 1.
Dataset Splits Yes We used a subset of the Image Net validation set to tune the two hyperparameters of ASIF which were then used on the other data sets. ... As a further experiment, we randomly selected 100 images from the Euro SAT dataset and incorporated them into our ASIF training set, raising the total to 1,500,100 image-text pairs and leaving 26,900 images for testing.
Hardware Specification Yes To optimize performance on a single Tesla T4 GPU, we limited our analysis to the initial 1.6M pairs.
Software Dependencies No The paper mentions using specific models and architectures such as 'Sentence T transformer', 'DEIT base', and 'DINO VITs8', but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or the implementations of these models.
Experiment Setup Yes We tuned k and p on the Image Net validation set, in both cases we used k = 800 and p = 8. ... Besides the pivotal choice of the ground-truth multimodal pairs, the number of non-zero elements k and the exponent p are the salient hyperparameters to consider when deploying an ASIF model.