Provable Dynamic Fusion for Low-Quality Multimodal Data
Authors: Qingyang Zhang, Haitao Wu, Changqing Zhang, Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou, Xi Peng
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on multiple benchmarks can support our findings. In this section, we conduct experiments on multiple datasets of diverse applications. |
| Researcher Affiliation | Academia | 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2Tianjin Key Lab of Machine Learning, Tianjin University, Tianjin, China 3Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore 4Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore 5College of Computer Science, Sichuan University, Chengdu, China. |
| Pseudocode | Yes | Algorithm 1 Training Pseudo Code of Quality-aware Multimodal Fusion (QMF) |
| Open Source Code | Yes | Code is available at https://github.com/Qingyang Zhang/QMF. |
| Open Datasets | Yes | We evaluate our method on two multimodal classification tasks. Scenes Recognition: NYU Depth V2 (Silberman et al., 2012) and SUN RGB-D (Song et al., 2015) are two public indoor scenes recognition datasets, which are associated with two modalities, i.e., RGB and depth images. Image-text classification: The UPMC FOOD101 dataset (Wang et al., 2015) contains (possibly noisy) images obtained by Google Image Search and corresponding textual descriptions. MVSA sentiment analysis dataset (Niu et al., 2016) includes a set of image-text pairs with manual annotations collected from social media. |
| Dataset Splits | Yes | For FOOD-101, following the previous work (Kiela et al., 2019), there are 60101 image-text pairs in the training set, 5000 image-text pairs in the validation set, and 21695 image-text pairs in the test set. The validation set contains 518 image-text pairs, and the test set contains 519 image-text pairs. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer', 'Res Net', 'Bert', and 'Mind Spore', but does not provide specific version numbers for these software dependencies, which are required for reproducibility. |
| Experiment Setup | Yes | The learning rate is 1e-4 and the dropout rate is 0.1. The learning rate is 1e-4 with a warmup rate of 0.1. The hyperparameter λ is set to 0.1. Temperature parameters {T m}M m=1 are set to 1. We adopt the early stop strategy based on validation accuracy. |