What Makes Multi-Modal Learning Better than Single (Provably)
Authors: Yu Huang, Chenzhuang Du, Zihui Xue, Xuanyao Chen, Hang Zhao, Longbo Huang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiment We conduct experiments to validate our theoretical results. The source of the data we consider is two-fold, multi-modal real-world dataset and well-designed generated dataset. |
| Researcher Affiliation | Academia | Yu Huang1, , Chenzhuang Du1,*, Zihui Xue2, Xuanyao Chen3,4, Hang Zhao1, Longbo Huang1, 1 Institute for Interdisciplinary Information Sciences, Tsinghua University 2 The University of Texas at Austin 3 Fudan University 4 Shanghai Qi Zhi Institute |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described. |
| Open Datasets | Yes | The natural dataset we use is the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, which is an acted multi-modal and multi-speaker database [6]. |
| Dataset Splits | No | The paper mentions 13200 data for training and 3410 for testing, but does not specify a validation set or its size, nor does it describe the splitting methodology in detail for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions training settings like optimizer and batch size. |
| Software Dependencies | No | The paper mentions using "Adam [25] as the optimizer" but does not provide specific version numbers for any software libraries, programming languages, or other dependencies. |
| Experiment Setup | Yes | For all experiments on IEMOCAP, we use one linear neural network layer to extract the latent feature, and we set the hidden dimension to be 128. In multi-modal network, different modalities do not share encoders and we concatenate the features first, and then map the feature to the task space. We use Adam [25] as the optimizer and set the learning rate to be 0.01, with other hyper-parameters default. The batch size is 2048 for the data. For this classification task, the top-1 accuracy is used for performance measurement. |