Facilitating Multimodal Classification via Dynamically Learning Modality Gap
Authors: Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, Yi Xu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) multimodal learning approaches. |
| Researcher Affiliation | Academia | Nanjing University of Science and Technology Dalian University of Technology {yyang,fqwan,jiangqy}@njust.edu.cn, yxu@dlut.edu.cn |
| Pseudocode | Yes | Algorithm 1: The Proposed Algorithm. |
| Open Source Code | Yes | The code is available at https://github.com/njustkmg/Neur IPS24-LFM. |
| Open Datasets | Yes | We select six widely used datasets, including Kinetics Sounds [2], CREMA-D [5], Sarcasm [4], Twitter2015 [49], NVGesture [42], and VGGSound [7] datasets, to validate our proposed method. ... all datasets we used in this paper are available online based on their corresponding paper. |
| Dataset Splits | Yes | The Kinetics Sounds dataset, which contains 19,000 video clips categorized into 31 distinct actions, aims to advance video action recognition. It is divided into a training set of 15,000 clips, a validation set of 1,900 clips, and a test set of 1,900 clips. ... The Sarcasm dataset offers a compilation of 24,635 text-image pairs, divided into 19,816 for the training set, 2,410 for the validation set, and 2,409 for the test set. The Twitter2015 dataset contains 5,338 text-image combinations from Twitter, with 3,179 in the training set, 1,122 in the validation set, and 1,037 in the test set. |
| Hardware Specification | Yes | All models are trained on a single RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using 'librosa [29]' for audio conversion, and 'Res Net18', 'Res Net50', and 'BERT [9]' as architectures/models. However, it does not specify version numbers for any software libraries or dependencies, which is required for a reproducible description. |
| Experiment Setup | Yes | Optimization for the audio-video datasets is conducted using stochastic gradient descent (SGD) with a momentum set to 0.9 and a weight decay parameter of 10 1. We initialize the learning rate to 10 2, progressively reducing it by a factor of ten upon observing a plateau in loss reduction, with a batch size of 256. For text-image datasets [4, 49], we employ the Adam optimizer starting with a learning rate of 10 4, with a batch size of 128. |