Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Authors: Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Qun Liu, Dongyan Zhao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset will be publicly available, and thus we call for more attention on modelling speaker information when understanding conversations. Experiment Results Results can be found in Table 2: (1) visual context acquired by the vision model M1, including which face appears in the frame and looks like a speaking face, serves as the most critical clues, shown by the performance of M1 (line 1, 2).
Researcher Affiliation Collaboration Yueqian Wang1, Xiaojun Meng2, Yuxuan Wang3, Jianxin Liang1, Qun Liu2, Dongyan Zhao1,4 * 1Wangxuan Institute of Computer Technology, Peking University 2Huawei Noah s Ark Lab 3Beijing Institute for General Artificial Intelligence 4National Key Laboratory of General Artificial Intelligence EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes a baseline method with three modules and provides a model overview in Figure 3, but it does not include explicit pseudocode or algorithm blocks with structured steps.
Open Source Code No The code and dataset will be publicly available, and thus we call for more attention on modelling speaker information when understanding conversations.
Open Datasets Yes The code and dataset will be publicly available, and thus we call for more attention on modelling speaker information when understanding conversations. Datasets https://github.com/yellow-binary-tree/Friends-MMC
Dataset Splits Yes Faces and their character names in each frame are detected and labelled automatically for the train set (Season 1, 2, and from 4 to 10), and manually for the test set (Season 3) to ensure its accuracy. For the test set, we directly use the human-annotated faces in C1C to guarantee the accuracy of face labelling, thus serving as high-quality ground-truths for this test set. Moreover, in order to align with the fact of imperfect face recognition in real-world scenarios and be consistent with the train set, we also create a more challenging test-noisy set by randomly removing 20% labelled face tracks. Dataset statistics are shown in Table 3. We provide a train set, a test set and a more challenging test-noisy set.
Hardware Specification No The paper describes various models used (e.g., Inception, DeBERTa-v3, Talk Net, Violet, LLaVA, Emu, GPT-3.5, GPT-4o) and fine-tuning processes but does not specify any particular hardware specifications such as GPU or CPU models used for these experiments.
Software Dependencies Yes By now, this problem can be easily solved using optimization problem solvers like (Gurobi Optimization, LLC 2023), which adaptively makes decisions based on the output of M1 and M2.
Experiment Setup Yes If the mean value of the largest 5 cosine similarities is greater than a threshold t = 0.6 (which is set to maximize the validation accuracy described in the following paragraph), we label this face track with the corresponding character name, otherwise we think this face does not belong to any of the main characters and discard it. We use the cross-entropy classification loss as the training objective. The loss function is defined as: LM2 = MSE(psim, ysim) + MSE(psim, p T sim) where α is a hyperparameter to control the weight of two rewards and is selected according to the performance on a validation set held-out from the train set. We use 0.8 for frame as visual context, 0.7 for video as visual context, and 0.2 when ground truth labels of the text model are provided (M 2).