Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
Authors: Divyam Madaan, Taro Makino, Sumit Chopra, Kyunghyun Cho
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency. |
| Researcher Affiliation | Collaboration | Divyam Madaan1 Taro Makino2 Sumit Chopra1,2,3 Kyunghyun Cho1,2,4,5 1Courant Institute of Mathematical Sciences, New York University 2Center for Data Science, New York University 3Grossman School of Medicine, New York University 4Prescient Design, Genentech 5CIFAR LMB |
| Pseudocode | No | The paper describes its models and methods using equations and natural language, but no structured pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | The code is available at https://github.com/divyam3897/I2M2. |
| Open Datasets | Yes | AV-MNIST combines audio and visual modalities for MNIST digit (0-9) recognition task. We use 55000 examples for training, 5000 validation, and 10000 testing examples for evaluation. The fast MRI dataset [81] was the first large-scale dataset that consisted of raw k-space data... The Medical Information Mart for Intensive Care (MIMIC-III) dataset is a popular medical benchmark... NLVR2 dataset [70] incorporated real-world photographs. VQA-VS [65] consolidated the training and validations sets from VQA v2 dataset. |
| Dataset Splits | Yes | We use 55000 examples for training, 5000 validation, and 10000 testing examples for evaluation. (AV-MNIST) Following Liang et al. [41], we split the dataset into 80% training, 10% for validation and 10% for testing. This results in 28, 970 training, 3, 621 validation, and 3, 621 examples for testing. (MIMIC-III) |
| Hardware Specification | Yes | All experiments were conducted on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions various models and optimizers used (e.g., LeNet, Preact ResNet-18, FIBER model with Swin Transformer and RoBERTa, SGD, RMSProp) but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | For training all the models, we optimize using cross-entropy loss with SGD using learning rate and weight decay equal to 5e-2 and 1e-4 for 25 epochs. (AV-MNIST) We use a batch-size of 40 and RMSProp with a learning rate of 1e-3 to train for twenty epochs across all the tasks. (MIMIC-III) We fine-tune a MLP classifier on top of the encoder with learning rate 1e-4 for VQA-VS. For NLVR2, we fine-tune the full-model with the learning rate 1e-5 for five seeds. (NLVR2 and VQA-VS) |