Facilitating Multimodal Classification via Dynamically Learning Modality Gap

Authors: Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, Yi Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) multimodal learning approaches.
Researcher Affiliation Academia Nanjing University of Science and Technology Dalian University of Technology {yyang,fqwan,jiangqy}@njust.edu.cn, yxu@dlut.edu.cn
Pseudocode Yes Algorithm 1: The Proposed Algorithm.
Open Source Code Yes The code is available at https://github.com/njustkmg/Neur IPS24-LFM.
Open Datasets Yes We select six widely used datasets, including Kinetics Sounds [2], CREMA-D [5], Sarcasm [4], Twitter2015 [49], NVGesture [42], and VGGSound [7] datasets, to validate our proposed method. ... all datasets we used in this paper are available online based on their corresponding paper.
Dataset Splits Yes The Kinetics Sounds dataset, which contains 19,000 video clips categorized into 31 distinct actions, aims to advance video action recognition. It is divided into a training set of 15,000 clips, a validation set of 1,900 clips, and a test set of 1,900 clips. ... The Sarcasm dataset offers a compilation of 24,635 text-image pairs, divided into 19,816 for the training set, 2,410 for the validation set, and 2,409 for the test set. The Twitter2015 dataset contains 5,338 text-image combinations from Twitter, with 3,179 in the training set, 1,122 in the validation set, and 1,037 in the test set.
Hardware Specification Yes All models are trained on a single RTX 3090 GPU.
Software Dependencies No The paper mentions using 'librosa [29]' for audio conversion, and 'Res Net18', 'Res Net50', and 'BERT [9]' as architectures/models. However, it does not specify version numbers for any software libraries or dependencies, which is required for a reproducible description.
Experiment Setup Yes Optimization for the audio-video datasets is conducted using stochastic gradient descent (SGD) with a momentum set to 0.9 and a weight decay parameter of 10 1. We initialize the learning rate to 10 2, progressively reducing it by a factor of ten upon observing a plateau in loss reduction, with a batch size of 256. For text-image datasets [4, 49], we employ the Adam optimizer starting with a learning rate of 10 4, with a batch size of 128.