Connecting Multi-modal Contrastive Representations

Authors: Zehan Wang, Yang Zhao, Xize 成, Haifeng Huang, Jiageng Liu, Aoxiong Yin, Li Tang, Linjun Li, Yongqi Wang, Ziang Zhang, Zhou Zhao

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the effectiveness of C-MCR, we take the field of audio-visual and 3Dlanguage learning as examples. Specifically, we connect CLIP and CLAP via texts to derive audio-visual representations, and integrate CLIP and ULIP via images for 3D-language representations. Remarkably, without using any paired data, C-MCR for audio-visual achieves state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks. Furthermore, C-MCR for 3D-language also attains advanced zero-shot 3D point cloud classification accuracy on Model Net40. Our project page is available at https://c-mcr.github.io/C-MCR/
Researcher Affiliation Collaboration 1Zhejiang University 2Byte Dance 3Shanghai AI Laboratory {wangzehan01}@zju.edu.cn
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Our project page is available at https://c-mcr.github.io/C-MCR/ This is a project page, not an explicit statement of open-source code release or a direct link to a code repository.
Open Datasets Yes Text Datasets. We collected texts from three sources: image-text datasets (COCO [58] and CC3M [59]), video-text datasets (MSRVTT [60], MAD [61]), and audio-text datasets (Audio Cap [62], Clotho [63]), to ensure that the texts contain sufficient visual, action, and audio information... Audio/Image Memory. Audio Set [21] provides a vast collection of audio snippets from You Tube videos... Image Net1K [64] is a large-scale image recognition dataset... Image Datasets. The image dataset used for connecting ULIP and CLIP is Image Net1K [64]...
Dataset Splits Yes Due to the small size of the test sets in both datasets, we utilized all available data in the train, eval, and test sets for evaluation, resulting in 4095 samples for AVE and 5000 samples for Flickr-Sound Net.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using pre-trained models (CLIP, CLAP, ULIP-2) and optimizers (AdamW), but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries.
Experiment Setup Yes Implementation Details. We employ a frozen pre-trained CLIP Vi T-B/32 model [1] and CLAP model [13]. We adopt simple multi-layer perceptrons as our projectors f1( ) and f2( ). The τ1, τ2 and τ3 in Euqation 2, 5 and 6 are all set to 1/100. The variance σ2 of the noises in Equation 3 is set as 0.004. The hyper-parameter λ in Equation 10 is set to 0.1. We train our projectors for 36 epochs using a batch size of 10240. We use the Adam W optimizer with the initial learning rate 1e-3 and the cosine learning rate decay strategy.