reproducibilityindex.ai

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Authors: Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang, Kai Gao

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning.
Researcher Affiliation	Academia	Qianrui Zhou1,2, Hua Xu1,2*, Hao Li1,2, Hanlei Zhang1,2, Xiaohan Zhang1, 3, Yifan Wang1, 3, Kai Gao3 1Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China 3School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China
Pseudocode	No	The paper describes the method using text and mathematical formulations, but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	The codes are released at https://github.com/thuiar/TCL-MAP.
Open Datasets	Yes	We conduct experiments on two challenging multimodal datasets to evaluate our proposed framework. MInt Rec (Zhang et al. 2022) is a fine-grained dataset for multimodal intent recognition... MELD-DA (Saha et al. 2020) is a large-scale dataset for dialogue act classification...
Dataset Splits	Yes	We follow the dataset splits consisting of 1,334 samples for training, 445 samples for validation, and 445 samples for testing. The dataset is divided into a training set of 6,991 samples, a validation set of 999 samples and a test set of 1,998 samples.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, processor types, or memory amounts) used for running experiments are mentioned.
Software Dependencies	No	We utilize bert-base-uncased and wav2vec2-base-960h from Huggingface Library (Wolf et al. 2019) to extract text and audio features and swin b pre-trained on Image Net1K (Deng et al. 2009) from Torchvision Library (maintainers and contributors 2016) to extract video features. (Specific library versions are not provided for Huggingface Library or Torchvision Library.)
Experiment Setup	Yes	The training batch size is set to 16, while the validation and test batch sizes are both set to 8. For the total loss L, we employ Adam W (Loshchilov and Hutter 2017) to optimize the parameters.