reproducibilityindex.ai

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Authors: Yizhen Zhang, Minkyu Choi, Kuan Han, Zhongming Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train a language model and a vision model jointly to match images and texts. We further analyze the semantic space obtained with the visually grounded language model. In this space, semantic embeddings are found to be organized and clustered by visual attributes, predictive of human-deﬁned norms of semantic features, useful for compositional language understanding and cross-modal image search.
Researcher Affiliation	Academia	1 Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109 2 Department of Neurological Surgery, University of California San Francisco, San Francisco, CA 94143 3 Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109 {zhyz, cminkyu, kuanhan, zmliu}@umich.edu
Pseudocode	No	The paper describes methods using natural language and equations (e.g., Eq. 1, 2, 3, 4, 5, 6, 7) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We train the model in three stages... on the MS COCO dataset [11]. ...after cleaning the dataset to include 114 relations and 55 object classes...based on the Visual Genome dataset [45]...The language stream is the pretrained Bert1 used as the baseline model for subsequent experiments. The visual stream is pretrained for object classiﬁcation with Image Net [61]. ...we use the language stream as a stand-alone model to extract the output representations of commonly used English words in the Sem Cat dataset (9, 197 words; 100 word categories) [62]. We use the concept property norm dataset from the Centre for Speech, Language and the Brain (CSLB) [65].
Dataset Splits	Yes	The visual stream is pretrained for object classiﬁcation with Image Net [61]. Relative to the baseline CNN, the inclusion of self-attention improves the top-1 classiﬁcation accuracy from 71.6% to 74.3% on the Image Net validation dataset.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory, or specific computing infrastructure used for experiments.
Software Dependencies	Yes	The language stream is the pretrained Bert1 used as the baseline model for subsequent experiments. 1bert-base-uncased: https://huggingface.co/transformers/pretrained_models.html
Experiment Setup	Yes	While freezing other layers, we reﬁne the self-attention layer in the visual stream and the top k layers in Bert (by default k = 8). Training with contrastive learning is based on the MS COCO dataset... In the third stage, we further ﬁnetune the model for visual relation prediction... We reﬁne the visual self-attention layer and the higher l layers in Bert (by default l = 2)...