video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Authors: Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25% absolute accuracy improvements on the video QA task and over 30% absolute accuracy improvements on audio-visual QA tasks with human speech. The results of video-SALMONN on the SAVE benchmark tasks are summarised in Table 2 and Table 3 for singlemodal and audio-visual tasks respectively. This section particularly focuses on the key structural novelty, including MRC Q-Former, the fine-grained synchronisation, as well as training techniques in video-SALMONN on selected SAVE benchmark tasks, as summarised in Table 4.
Researcher Affiliation Collaboration 1Department of Electronic Engineering, Tsinghua University 2Byte Dance Ltd. Correspondence to: Chao Zhang <cz277@tsinghua.edu.cn>.
Pseudocode No The paper describes the model structure and training approach in text and with diagrams (e.g., Figure 1, Figure 2, Figure 3), but no explicit pseudocode or algorithm block is provided.
Open Source Code Yes Our training code and model checkpoints are available at https: //github.com/bytedance/SALMONN/.
Open Datasets Yes Multi-task instruction fine-tuning is used to train model parameters of MRC Q-Former and Lo RA in video SALMONN. Training data contains both single-modal and audio-visual paired data. For audio-only tasks, Libri Speech train-clean-100 and train-clean-360 sets are used for ASR, and Audio Caps are used for AAC. For visual-only tasks. A mixture of LLAVA-150k (Liu et al., 2023) image QA data, OCRVQA OCR data (Mishra et al., 2019), Text Caps (Sidorov et al., 2020) image caption data, NEx T-QA1 video QA training data (Xiao et al., 2021), 5000 samples from COCO train2014 data with spoken captions (Lin et al., 2014) as well as 11k samples from Video Chat (Li et al., 2023b) are used. For audio-visual tasks, randomly selected 600-hour Ego4D video captioning data (Grauman et al., 2022), How2 300-hour training set AVSR data and audio-visual sceneaware dialogue (AVSD) training set are used.
Dataset Splits No The paper mentions training and testing sets (e.g., 'Libri Speech train-clean-100 and train-clean-360 sets' for training and 'Libri Speech test-clean' for testing). It also mentions 'How2 dev5' which is a development set often used for validation, but a clear, reproducible breakdown of train/validation/test splits across all datasets or a general methodology for creating validation sets is not provided.
Hardware Specification No The paper discusses computational and storage costs but does not provide specific hardware details such as GPU/CPU models, memory specifications, or cloud instance types used for running experiments.
Software Dependencies No The paper mentions specific model components like 'Vicuna-v1.5', 'Whisper large-v2 encoder', 'BEATs encoder', and 'Instruct BLIP vision Transformer', but does not list general software dependencies (e.g., Python, PyTorch, CUDA) with specific version numbers.
Experiment Setup Yes To validate video-SALMONN on the SAVE benchmark, the Vicuna-v1.5 (Chiang et al., 2023) models (including 7B and 13B models, and 13B is the default option if not specified) is used as the LLM, Whisper (Radford et al., 2023) large-v2 encoder as the speech encoder, BEATs (Chen et al., 2023d) encoder as the audio encoder and Instruct BLIP (Dai et al., 2023) vision Transformer (Vi T) plus Q-Former as the visual encoder. The visual encoder outputs 32 feature vectors for each video frame (every 0.5 seconds), and the audio encoder outputs 50 feature vectors per second. The MRC Q-Former has two Transformer blocks with D=768-dim hidden states. By default, we adopt two different levels of resolution at 0.5-second and 5-second respectively, with the number of output query vectors being 3 and 30 for each window. The output query vectors of the MRC Q-Former are projected to E=5120-dim before being sent to the LLM. The LLM is adapted using the low-rank adaptation (Lo RA) (Hu et al., 2022) method with rank 32.