Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Authors: Yizhen Zhang, Minkyu Choi, Kuan Han, Zhongming Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train a language model and a vision model jointly to match images and texts. We further analyze the semantic space obtained with the visually grounded language model. In this space, semantic embeddings are found to be organized and clustered by visual attributes, predictive of human-defined norms of semantic features, useful for compositional language understanding and cross-modal image search.
Researcher Affiliation Academia 1 Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 2 Department of Neurological Surgery, University of California San Francisco, San Francisco, CA 94143 3 Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109 {zhyz, cminkyu, kuanhan, zmliu}@umich.edu
Pseudocode No The paper describes methods using natural language and equations (e.g., Eq. 1, 2, 3, 4, 5, 6, 7) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets Yes We train the model in three stages... on the MS COCO dataset [11]. ...after cleaning the dataset to include 114 relations and 55 object classes...based on the Visual Genome dataset [45]...The language stream is the pretrained Bert1 used as the baseline model for subsequent experiments. The visual stream is pretrained for object classification with Image Net [61]. ...we use the language stream as a stand-alone model to extract the output representations of commonly used English words in the Sem Cat dataset (9, 197 words; 100 word categories) [62]. We use the concept property norm dataset from the Centre for Speech, Language and the Brain (CSLB) [65].
Dataset Splits Yes The visual stream is pretrained for object classification with Image Net [61]. Relative to the baseline CNN, the inclusion of self-attention improves the top-1 classification accuracy from 71.6% to 74.3% on the Image Net validation dataset.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or specific computing infrastructure used for experiments.
Software Dependencies Yes The language stream is the pretrained Bert1 used as the baseline model for subsequent experiments. 1bert-base-uncased: https://huggingface.co/transformers/pretrained_models.html
Experiment Setup Yes While freezing other layers, we refine the self-attention layer in the visual stream and the top k layers in Bert (by default k = 8). Training with contrastive learning is based on the MS COCO dataset... In the third stage, we further finetune the model for visual relation prediction... We refine the visual self-attention layer and the higher l layers in Bert (by default l = 2)...