Accommodating Audio Modality in CLIP for Multimodal Processing

Authors: Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.The corresponding code and checkpoints will be released at https://github.com/ludanruan/CLIP4VLA. [...] CLIP4VLA is demonstrated to be effective on both retrieval and captioning tasks, requiring much less hardware resource and training time. Our contributions can be summarized as follows: [...] Our model achieves the state-of-the-art performance in retrieval and captioning tasks on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps.
Researcher Affiliation Academia Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin* School of Information, Renmin University of China {ruanld,anwenhu,syuqing,zhangliang00,zhengsipeng,qjin}@ruc.edu.cn
Pseudocode No The paper describes the model structure and training process in text and diagrams, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The corresponding code and checkpoints will be released at https://github.com/ludanruan/CLIP4VLA.
Open Datasets Yes Our pre-training data includes instructional video dataset Howto100M (Miech et al. 2019) and event video dataset Audioset (Gemmeke et al. 2017). [...] We evaluate the pre-trained CLIP4VLA on retrieval and captioning benchmarks, including MSR-VTT (Xu et al. 2016), VATEX (Wang et al. 2019) and Audiocaps (Kim et al. 2019). [...] For training cost comparison with previous work, we further measure our model on event classification datasets of UCF101 (Soomro, Zamir, and Shah 2012) and ESC50 (Piczak 2015).
Dataset Splits Yes After filtering out the silent videos, MSR-VTT remains 7867 and 884 videos for training and testing on the retrieval task, and 5867, 448, and 2617 videos for training, validation, and testing on the captioning task. VATEX remains 24667, 1427, and 1421 videos for training, validation, and testing on retrieval, and 24667, 2845, and 5698 videos for training, validation, and testing on captioning. Audiocaps keeps 49712, 495, and 967 videos for training, validation, and testing on the retrieval task.
Hardware Specification Yes CLIP4VLA 48 V100 days 256 88M 86.8 91.9
Software Dependencies No The paper does not specify any software dependencies (e.g., programming languages, libraries, frameworks) with version numbers.
Experiment Setup No The paper describes general experimental settings like pre-training datasets and feature extraction parameters (e.g., 224-dimensional log Mel filterbank with 32ms Hamming window every 8ms). However, it lacks specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer details in the main text. It defers some details to supplementary material, but this question asks for main text explicit details.