Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

Authors: Divyam Madaan, Taro Makino, Sumit Chopra, Kyunghyun Cho

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.
Researcher Affiliation Collaboration Divyam Madaan1 Taro Makino2 Sumit Chopra1,2,3 Kyunghyun Cho1,2,4,5 1Courant Institute of Mathematical Sciences, New York University 2Center for Data Science, New York University 3Grossman School of Medicine, New York University 4Prescient Design, Genentech 5CIFAR LMB
Pseudocode No The paper describes its models and methods using equations and natural language, but no structured pseudocode or algorithm blocks are provided.
Open Source Code Yes The code is available at https://github.com/divyam3897/I2M2.
Open Datasets Yes AV-MNIST combines audio and visual modalities for MNIST digit (0-9) recognition task. We use 55000 examples for training, 5000 validation, and 10000 testing examples for evaluation. The fast MRI dataset [81] was the first large-scale dataset that consisted of raw k-space data... The Medical Information Mart for Intensive Care (MIMIC-III) dataset is a popular medical benchmark... NLVR2 dataset [70] incorporated real-world photographs. VQA-VS [65] consolidated the training and validations sets from VQA v2 dataset.
Dataset Splits Yes We use 55000 examples for training, 5000 validation, and 10000 testing examples for evaluation. (AV-MNIST) Following Liang et al. [41], we split the dataset into 80% training, 10% for validation and 10% for testing. This results in 28, 970 training, 3, 621 validation, and 3, 621 examples for testing. (MIMIC-III)
Hardware Specification Yes All experiments were conducted on a single NVIDIA A100 GPU.
Software Dependencies No The paper mentions various models and optimizers used (e.g., LeNet, Preact ResNet-18, FIBER model with Swin Transformer and RoBERTa, SGD, RMSProp) but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes For training all the models, we optimize using cross-entropy loss with SGD using learning rate and weight decay equal to 5e-2 and 1e-4 for 25 epochs. (AV-MNIST) We use a batch-size of 40 and RMSProp with a learning rate of 1e-3 to train for twenty epochs across all the tasks. (MIMIC-III) We fine-tune a MLP classifier on top of the encoder with learning rate 1e-4 for VQA-VS. For NLVR2, we fine-tune the full-model with the learning rate 1e-5 for five seeds. (NLVR2 and VQA-VS)