Multi-Head Modularization to Leverage Generalization Capability in Multi-Modal Networks
Authors: Jun-Tae Lee, Hyunsin Park, Sungrack Yun, Simyung Chang7354-7362
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify the effectiveness of MHM on various multi-modal tasks. We use the state-of-the-art methods as baselines, and show notable performance gain for all the baselines. We conduct extensive experiment to analyze the efficacy of MHM in terms of generalization capability. For three multi-modal tasks (audio-visual event detection, action localization, sentiment analysis), we successfully boost the performance of the state-of-the-art methods on benchmark datsets (AVE, THUMOS14, CMU-MOSEI). |
| Researcher Affiliation | Industry | Jun-Tae Lee1, Hyunsin Park1, Sungrack Yun1, and Simyung Chang2 Qualcomm AI Research1* Qualcomm Korea YH2 {juntlee,hyunsinp,sungrack,simychan}@qti.qualcomm.com |
| Pseudocode | No | The paper describes the MHM algorithm in detail within the text (Section 4) but does not provide it in a structured pseudocode block or a formally labeled algorithm. |
| Open Source Code | No | The paper does not provide any statement regarding the release of source code or a link to a code repository. |
| Open Datasets | Yes | We perform experiments on AVE dataset (Tian et al. 2018). We use THUMOS14 (Jiang et al. 2014) dataset. We evaluate our method for the multi-modal sentiment analysis on CMU-MOSEI (Zadeh et al. 2018) dataset. |
| Dataset Splits | Yes | For each class, we randomly select 90% of data points for training and use the remaining for testing. (toy dataset); AVE dataset... It consists of 3,339 training and 804 testing videos; THUMOS14 (Jiang et al. 2014) dataset containing 200 training and 212 testing videos. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers with their versions) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | For audio-visual event detection, we use four head modules. For RGB-flow action localization, the number of head modules is set to 2. For multi-modal sentiment analysis, K is empirically set as 3. |