Extending Multi-modal Contrastive Representations
Authors: Ziang Zhang, Zehan Wang, Luping Liu, Rongjie Huang, Xize Cheng, Zhenhui Ye, wang lin, Huadai Liu, Haifeng Huang, Yang Zhao, Tao Jin, Siqi Zheng, Zhou Zhao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we extend pre-trained audio-text and 3Dimage representations to the existing image-text space. Without using paired data, Ex-MCR achieves comparable performance to advanced methods on a series of audio-image-text and 3D-image-text tasks and achieves superior performance when used in parallel with data-driven methods. |
| Researcher Affiliation | Collaboration | Ziang Zhang 1, 2 Zehan Wang 1, 2 Luping Liu1 Rongjie Huang1 Xize Cheng1 Zhenhui Ye1 Wang Lin1 Huadai Liu1 Haifeng Huang1 Yang Zhao1 Tao Jin1 Siqi Zheng3 Zhou Zhao1, 2 1Zhejiang University 2Shanghai AI Laboratory 3Alibaba Group |
| Pseudocode | No | The paper describes the steps of its proposed method in detail within the text and using figures, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our project page is available at https://github.com/MCR-PEFT/Ex-MCR. |
| Open Datasets | Yes | For a fair comparison, we use the same unimodal datasets to C-MCR [19] for training, totaling 2.31M texts, 1.3M images, 1.8M audio, and 0.8M 3D point clouds. ... Text Dataset ... in image-text datasets (COCO, CC3M), video-text datasets (MSRVTT, MAD), and audio-text datasets (Audio Caps, Clotho). ... Image Dataset ... Image Net1K ... Audio Dataset Audio Set ... 3D Point Cloud Dataset For the 3D modality, we use Objaverse... |
| Dataset Splits | No | The paper describes using various unimodal datasets for training (e.g., CC3M, ImageNet1K, AudioSet, Objaverse) but does not specify traditional train/validation/test splits for these datasets within its own training process, as it is a paired-data-free method. Validation sets are mentioned for evaluating performance on downstream tasks, not for the training of the Ex-MCR model itself. |
| Hardware Specification | Yes | Collecting a group of pseudo datasets takes about 10 hours on a single 4090 while using 12GB GPU memory. The training times for projectors between two spaces are approximately 1.5 hours, on a single 4090, and it only requires 3GB of GPU memory. |
| Software Dependencies | No | The paper mentions specific pre-trained models used (e.g., CLIP Vi T-B/32, CLAP, ULIPv2, Open CLIP Vi T-H) but does not list general software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The temperature τ1 in Eq.12 for embedding aggregation is set to 0.01 following [19], while the τ2 in Eq.6 is set to 0.05. The hyper-parameter λ in Eq.7 is set to 0.1. ... We train our model with a batch size of 4096 for 36 epochs. We employ the Adam W optimizer with an initial learning rate of 1e-3 and a cosine learning rate decay strategy. |