reproducibilityindex.ai

Facilitating Multimodal Classification via Dynamically Learning Modality Gap

Authors: Yang Yang, Fengqiang Wan, Qing-Yuan Jiang, Yi Xu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) multimodal learning approaches.
Researcher Affiliation	Academia	Nanjing University of Science and Technology Dalian University of Technology {yyang,fqwan,jiangqy}@njust.edu.cn, yxu@dlut.edu.cn
Pseudocode	Yes	Algorithm 1: The Proposed Algorithm.
Open Source Code	Yes	The code is available at https://github.com/njustkmg/Neur IPS24-LFM.
Open Datasets	Yes	We select six widely used datasets, including Kinetics Sounds [2], CREMA-D [5], Sarcasm [4], Twitter2015 [49], NVGesture [42], and VGGSound [7] datasets, to validate our proposed method. ... all datasets we used in this paper are available online based on their corresponding paper.
Dataset Splits	Yes	The Kinetics Sounds dataset, which contains 19,000 video clips categorized into 31 distinct actions, aims to advance video action recognition. It is divided into a training set of 15,000 clips, a validation set of 1,900 clips, and a test set of 1,900 clips. ... The Sarcasm dataset offers a compilation of 24,635 text-image pairs, divided into 19,816 for the training set, 2,410 for the validation set, and 2,409 for the test set. The Twitter2015 dataset contains 5,338 text-image combinations from Twitter, with 3,179 in the training set, 1,122 in the validation set, and 1,037 in the test set.
Hardware Specification	Yes	All models are trained on a single RTX 3090 GPU.
Software Dependencies	No	The paper mentions using 'librosa [29]' for audio conversion, and 'Res Net18', 'Res Net50', and 'BERT [9]' as architectures/models. However, it does not specify version numbers for any software libraries or dependencies, which is required for a reproducible description.
Experiment Setup	Yes	Optimization for the audio-video datasets is conducted using stochastic gradient descent (SGD) with a momentum set to 0.9 and a weight decay parameter of 10 1. We initialize the learning rate to 10 2, progressively reducing it by a factor of ten upon observing a plateau in loss reduction, with a batch size of 256. For text-image datasets [4, 49], we employ the Adam optimizer starting with a learning rate of 10 4, with a batch size of 128.