Extending Multi-modal Contrastive Representations

Authors: Ziang Zhang, Zehan Wang, Luping Liu, Rongjie Huang, Xize Cheng, Zhenhui Ye, wang lin, Huadai Liu, Haifeng Huang, Yang Zhao, Tao Jin, Siqi Zheng, Zhou Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we extend pre-trained audio-text and 3Dimage representations to the existing image-text space. Without using paired data, Ex-MCR achieves comparable performance to advanced methods on a series of audio-image-text and 3D-image-text tasks and achieves superior performance when used in parallel with data-driven methods.
Researcher Affiliation Collaboration Ziang Zhang 1, 2 Zehan Wang 1, 2 Luping Liu1 Rongjie Huang1 Xize Cheng1 Zhenhui Ye1 Wang Lin1 Huadai Liu1 Haifeng Huang1 Yang Zhao1 Tao Jin1 Siqi Zheng3 Zhou Zhao1, 2 1Zhejiang University 2Shanghai AI Laboratory 3Alibaba Group
Pseudocode No The paper describes the steps of its proposed method in detail within the text and using figures, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our project page is available at https://github.com/MCR-PEFT/Ex-MCR.
Open Datasets Yes For a fair comparison, we use the same unimodal datasets to C-MCR [19] for training, totaling 2.31M texts, 1.3M images, 1.8M audio, and 0.8M 3D point clouds. ... Text Dataset ... in image-text datasets (COCO, CC3M), video-text datasets (MSRVTT, MAD), and audio-text datasets (Audio Caps, Clotho). ... Image Dataset ... Image Net1K ... Audio Dataset Audio Set ... 3D Point Cloud Dataset For the 3D modality, we use Objaverse...
Dataset Splits No The paper describes using various unimodal datasets for training (e.g., CC3M, ImageNet1K, AudioSet, Objaverse) but does not specify traditional train/validation/test splits for these datasets within its own training process, as it is a paired-data-free method. Validation sets are mentioned for evaluating performance on downstream tasks, not for the training of the Ex-MCR model itself.
Hardware Specification Yes Collecting a group of pseudo datasets takes about 10 hours on a single 4090 while using 12GB GPU memory. The training times for projectors between two spaces are approximately 1.5 hours, on a single 4090, and it only requires 3GB of GPU memory.
Software Dependencies No The paper mentions specific pre-trained models used (e.g., CLIP Vi T-B/32, CLAP, ULIPv2, Open CLIP Vi T-H) but does not list general software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The temperature τ1 in Eq.12 for embedding aggregation is set to 0.01 following [19], while the τ2 in Eq.6 is set to 0.05. The hyper-parameter λ in Eq.7 is set to 0.1. ... We train our model with a batch size of 4096 for 36 epochs. We employ the Adam W optimizer with an initial learning rate of 1e-3 and a cosine learning rate decay strategy.