XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning

Authors: Pritam Sarkar, Ali Etemad

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed method on a variety of downstream tasks including video action recognition, sound classification, and multimodal action classification. We use a total of 6 datasets for downstream tasks, namely Kinetics400 (K400) (Kay et al. 2017), Kinetics-Sound (KS) (Arandjelovic and Zisserman 2017), UCF101 (U101) (Soomro, Zamir, and Shah 2012), HMDB51 (H51) (Kuehne et al. 2011), ESC50 (E50) (Piczak 2015), and FSD50K (F50K) (Fonseca et al. 2022).
Researcher Affiliation Academia Pritam Sarkar1, 2, Ali Etemad1 1 Queen s University, Canada 2 Vector Institute {pritam.sarkar, ali.etemad}@queensu.ca
Pseudocode Yes Algorithm 1: XKD Training Algorithm.
Open Source Code Yes The code, pretrained models, and supplementary material are available on the project website1. 1https://pritamqu.github.io/XKD
Open Datasets Yes We pretrain XKD on 3 datasets of different sizes including the small-scale Kinetics-Sound (Arandjelovic and Zisserman 2017), large-scale Kinetics400 (Kay et al. 2017), and very large-scale Audio Set (Gemmeke et al. 2017). We evaluate our proposed method on a variety of downstream tasks including video action recognition, sound classification, and multimodal action classification. We use a total of 6 datasets for downstream tasks, namely Kinetics400 (K400) (Kay et al. 2017), Kinetics-Sound (KS) (Arandjelovic and Zisserman 2017), UCF101 (U101) (Soomro, Zamir, and Shah 2012), HMDB51 (H51) (Kuehne et al. 2011), ESC50 (E50) (Piczak 2015), and FSD50K (F50K) (Fonseca et al. 2022).
Dataset Splits No We redirect readers to see the details on evaluation protocols in the Suppl. Mat. Sec. A.5 and A.6. The main text refers to "linear evaluation and finetuning setup" and "split-1" but does not provide explicit percentages or counts for train/validation/test splits within the main paper.
Hardware Specification No The paper mentions using "Sci Net HPC Consortium for helping with the computation resources" in the acknowledgments but does not specify any particular hardware like GPU models, CPU types, or cloud instances used for running experiments.
Software Dependencies No The paper refers to architectural choices like Vi T but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup Yes During pretraining, to save computation we downsample the video input at 8 FPS and resize the frame resolution at 1122. Additionally, we re-sample the audio waveforms at 16 k Hz. and generate mel-spectrograms using 80 mel filters. We use a patch size of 4 16 for audio spectrograms and a cuboid size of 4 162 for video input. Empirically, we set λae, λda, and λkd as 5, 1, and 1 respectively.