Achieving Cross Modal Generalization with Multimodal Unified Representation

Authors: Yan Xia, Hai Huang, Jieming Zhu, Zhou Zhao

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various downstream tasks, i.e., cross-modal event classification, localization, cross-modal retrieval, query-based video segmentation, and cross-dataset event localization, demonstrate the effectiveness of our proposed methods.
Researcher Affiliation Collaboration 1Zhejiang University 2Shanghai Artificial Intelligence Laboratory 3Huawei Noah s Ark Lab
Pseudocode No The paper contains a network overview figure and mathematical equations but no structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/haihuangcode/CMG.
Open Datasets Yes We use VGGsound-AVEL [44, 45] to pre-train our unified representation, and divide it into several different sizes: 24K, 40K, 81K. and Cross-modal event classification (AVE [46]):... Cross-modal event localization (AVVP [47]):... Cross-modal video segmentation (AVSBench-S4 [48]):...
Dataset Splits No The paper mentions using VGGsound-AVEL (24K, 40K, 81K) for pre-training and AVE, AVVP, AVSBench-S4 for downstream tasks, but does not explicitly state the train/validation/test splits (percentages or counts) for any of these datasets, nor does it cite a source for predefined splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or specific computing environments used for running experiments.
Software Dependencies No The paper mentions using Mind Spore but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes where β is 0.25 for all our experiments... (50% in our setting, β is the same as in Eq 3). and The implementation details are provided in Appendix.