SECap: Speech Emotion Captioning with Large Language Model

Authors: Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shi-Xiong Zhang, Guangzhi Li, Yi Luo, Rongzhi Gu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results of objective and subjective evaluations demonstrate that: 1) the SECap framework outperforms the HTSAT-BART baseline in all objective evaluations; 2) SECap can generate high-quality speech emotion captions that attain performance on par with human annotators in subjective mean opinion score tests.
Researcher Affiliation Collaboration 1Shenzhen International Graduate Graduate School, Tsinghua University, Shenzhen, China 2Tencent AI Lab 3The Chinese University of Hong Kong, Hong Kong SAR, China
Pseudocode No The paper describes the model architecture and training processes with mathematical equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes 1Codes, models and results: https://github.com/thuhcsi/SECap
Open Datasets Yes Due to the lack of publicly available SEC datasets, we utilize an internal dataset called EMOSpeech. ... Please refer to project s Git Hub repository for detailed dataset construction process, where the test set is also publicly available.
Dataset Splits Yes Upon constructing the EMOSpeech dataset, we randomly select 600 sentences for testing, 600 sentences for validation, and the remaining 29,326 sentences for training2.
Hardware Specification No The paper does not specify any particular hardware components (e.g., specific GPU/CPU models, memory amounts) used for running experiments.
Software Dependencies No The paper mentions software components and cites associated research papers (e.g., 'LLa MA (Cui, Yang, and Yao 2023)', 'Hu BERT (Hsu et al. 2021)', 'BERT-base (Devlin et al. 2019)'), but it does not provide explicit version numbers for the underlying software libraries or frameworks (e.g., PyTorch version, Python version).
Experiment Setup No The paper describes the general training process (two-stage, frozen parameters, model initialization) and mentions pre-trained models. However, it does not explicitly state specific hyperparameter values such as learning rates, batch sizes, or total training epochs within the main text, noting that 'Specific experimental details are given at Git Hub repository.'