Multi-Head Modularization to Leverage Generalization Capability in Multi-Modal Networks

Authors: Jun-Tae Lee, Hyunsin Park, Sungrack Yun, Simyung Chang7354-7362

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the effectiveness of MHM on various multi-modal tasks. We use the state-of-the-art methods as baselines, and show notable performance gain for all the baselines. We conduct extensive experiment to analyze the efficacy of MHM in terms of generalization capability. For three multi-modal tasks (audio-visual event detection, action localization, sentiment analysis), we successfully boost the performance of the state-of-the-art methods on benchmark datsets (AVE, THUMOS14, CMU-MOSEI).
Researcher Affiliation Industry Jun-Tae Lee1, Hyunsin Park1, Sungrack Yun1, and Simyung Chang2 Qualcomm AI Research1* Qualcomm Korea YH2 {juntlee,hyunsinp,sungrack,simychan}@qti.qualcomm.com
Pseudocode No The paper describes the MHM algorithm in detail within the text (Section 4) but does not provide it in a structured pseudocode block or a formally labeled algorithm.
Open Source Code No The paper does not provide any statement regarding the release of source code or a link to a code repository.
Open Datasets Yes We perform experiments on AVE dataset (Tian et al. 2018). We use THUMOS14 (Jiang et al. 2014) dataset. We evaluate our method for the multi-modal sentiment analysis on CMU-MOSEI (Zadeh et al. 2018) dataset.
Dataset Splits Yes For each class, we randomly select 90% of data points for training and use the remaining for testing. (toy dataset); AVE dataset... It consists of 3,339 training and 804 testing videos; THUMOS14 (Jiang et al. 2014) dataset containing 200 training and 212 testing videos.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, cloud instances) used for running the experiments.
Software Dependencies No The paper does not specify software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers with their versions) that would be needed to replicate the experiments.
Experiment Setup Yes For audio-visual event detection, we use four head modules. For RGB-flow action localization, the number of head modules is set to 2. For multi-modal sentiment analysis, K is empirically set as 3.