Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition
Authors: Qianrui Zhou, Hua Xu, Hao Li, Hanlei Zhang, Xiaohan Zhang, Yifan Wang, Kai Gao
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. |
| Researcher Affiliation | Academia | Qianrui Zhou1,2, Hua Xu1,2*, Hao Li1,2, Hanlei Zhang1,2, Xiaohan Zhang1, 3, Yifan Wang1, 3, Kai Gao3 1Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China 3School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China |
| Pseudocode | No | The paper describes the method using text and mathematical formulations, but does not include a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | The codes are released at https://github.com/thuiar/TCL-MAP. |
| Open Datasets | Yes | We conduct experiments on two challenging multimodal datasets to evaluate our proposed framework. MInt Rec (Zhang et al. 2022) is a fine-grained dataset for multimodal intent recognition... MELD-DA (Saha et al. 2020) is a large-scale dataset for dialogue act classification... |
| Dataset Splits | Yes | We follow the dataset splits consisting of 1,334 samples for training, 445 samples for validation, and 445 samples for testing. The dataset is divided into a training set of 6,991 samples, a validation set of 999 samples and a test set of 1,998 samples. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, processor types, or memory amounts) used for running experiments are mentioned. |
| Software Dependencies | No | We utilize bert-base-uncased and wav2vec2-base-960h from Huggingface Library (Wolf et al. 2019) to extract text and audio features and swin b pre-trained on Image Net1K (Deng et al. 2009) from Torchvision Library (maintainers and contributors 2016) to extract video features. (Specific library versions are not provided for Huggingface Library or Torchvision Library.) |
| Experiment Setup | Yes | The training batch size is set to 16, while the validation and test batch sizes are both set to 8. For the total loss L, we employ Adam W (Loshchilov and Hutter 2017) to optimize the parameters. |