Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Extending Multi-modal Contrastive Representations
Authors: Ziang Zhang, Zehan Wang, Luping Liu, Rongjie Huang, Xize Cheng, Zhenhui Ye, wang lin, Huadai Liu, Haifeng Huang, Yang Zhao, Tao Jin, Siqi Zheng, Zhou Zhao
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we extend pre-trained audio-text and 3Dimage representations to the existing image-text space. Without using paired data, Ex-MCR achieves comparable performance to advanced methods on a series of audio-image-text and 3D-image-text tasks and achieves superior performance when used in parallel with data-driven methods. |
| Researcher Affiliation | Collaboration | Ziang Zhang 1, 2 Zehan Wang 1, 2 Luping Liu1 Rongjie Huang1 Xize Cheng1 Zhenhui Ye1 Wang Lin1 Huadai Liu1 Haifeng Huang1 Yang Zhao1 Tao Jin1 Siqi Zheng3 Zhou Zhao1, 2 1Zhejiang University 2Shanghai AI Laboratory 3Alibaba Group |
| Pseudocode | No | The paper describes the steps of its proposed method in detail within the text and using figures, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our project page is available at https://github.com/MCR-PEFT/Ex-MCR. |
| Open Datasets | Yes | For a fair comparison, we use the same unimodal datasets to C-MCR [19] for training, totaling 2.31M texts, 1.3M images, 1.8M audio, and 0.8M 3D point clouds. ... Text Dataset ... in image-text datasets (COCO, CC3M), video-text datasets (MSRVTT, MAD), and audio-text datasets (Audio Caps, Clotho). ... Image Dataset ... Image Net1K ... Audio Dataset Audio Set ... 3D Point Cloud Dataset For the 3D modality, we use Objaverse... |
| Dataset Splits | No | The paper describes using various unimodal datasets for training (e.g., CC3M, ImageNet1K, AudioSet, Objaverse) but does not specify traditional train/validation/test splits for these datasets within its own training process, as it is a paired-data-free method. Validation sets are mentioned for evaluating performance on downstream tasks, not for the training of the Ex-MCR model itself. |
| Hardware Specification | Yes | Collecting a group of pseudo datasets takes about 10 hours on a single 4090 while using 12GB GPU memory. The training times for projectors between two spaces are approximately 1.5 hours, on a single 4090, and it only requires 3GB of GPU memory. |
| Software Dependencies | No | The paper mentions specific pre-trained models used (e.g., CLIP Vi T-B/32, CLAP, ULIPv2, Open CLIP Vi T-H) but does not list general software dependencies with specific version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The temperature τ1 in Eq.12 for embedding aggregation is set to 0.01 following [19], while the τ2 in Eq.6 is set to 0.05. The hyper-parameter λ in Eq.7 is set to 0.1. ... We train our model with a batch size of 4096 for 36 epochs. We employ the Adam W optimizer with an initial learning rate of 1e-3 and a cosine learning rate decay strategy. |