Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Authors: Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations show Emotion-LLa MA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.
Researcher Affiliation Collaboration Zebang Cheng12 Zhi-Qi Cheng3 Jun-Yan He4 Jingdong Sun3 Kai Wang5 Yuxiang Lin1 Zheng Lian6 Xiaojiang Peng12 Alexander G. Hauptmann3 1Shenzhen Technology University 2Shenzhen University 3Carnegie Mellon University 4Alibaba Group 5National University of Singapore 6Institute of Automation, Chinese Academy of Sciences
Pseudocode Yes Algorithm 1 Multimodal Emotion Annotation Procedure
Open Source Code Yes Project: https://zebangcheng.github.io/Emotion-LLa MA Demo: https://huggingface.co/spaces/Zebang Cheng/Emotion-LLa MA
Open Datasets Yes To address these challenges, we introduce the MERR dataset (Sec. 3.1), which enables multimodal large models and supports instruction tuning to learn from diverse scenarios and generalize to real-world applications.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification Yes We train on 4*A100 GPUs for 300,000 steps, which takes around 20 hours.
Software Dependencies No The paper mentions several software components and models (e.g., Open Face toolkit, Mini GPT-v2, Qwen-Audio, LLa MA-3), but it does not provide specific version numbers for the overall experimental software dependencies.
Experiment Setup Yes For the global visual encoder, we employ the EVA model with full images sized at 448 448 pixels as input. For the local and temporal visual encoders, we first crop and align the faces within the images, then hierarchical sample 16 facial images as inputs for the MAE and Video MAE models. The audio is handled by the Hu BERT-Chinese large model. The extracted emotional features are transformed into a 4096-dimensional space via linear layers before being concatenated with text tokens. During the tuning process, we froze the visual and audio backbones, focusing on training the linear projection layer. For the language model (LLM), we utilize LLa MA2-chat (7B) equipped with Lo RA for parameter-efficient tuning. Following the Minigpt-v2 approach, we fine-tune the query and value projection matrices (Wq and Wv) by setting r = 64 and α = 16. Consequently, the trainable parameters of Emotion-LLa MA totaled only 34 million, representing a mere 0.495% of the overall parameter count. We train on 4*A100 GPUs for 300,000 steps, which takes around 20 hours.