XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning
Authors: Pritam Sarkar, Ali Etemad
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed method on a variety of downstream tasks including video action recognition, sound classification, and multimodal action classification. We use a total of 6 datasets for downstream tasks, namely Kinetics400 (K400) (Kay et al. 2017), Kinetics-Sound (KS) (Arandjelovic and Zisserman 2017), UCF101 (U101) (Soomro, Zamir, and Shah 2012), HMDB51 (H51) (Kuehne et al. 2011), ESC50 (E50) (Piczak 2015), and FSD50K (F50K) (Fonseca et al. 2022). |
| Researcher Affiliation | Academia | Pritam Sarkar1, 2, Ali Etemad1 1 Queen s University, Canada 2 Vector Institute {pritam.sarkar, ali.etemad}@queensu.ca |
| Pseudocode | Yes | Algorithm 1: XKD Training Algorithm. |
| Open Source Code | Yes | The code, pretrained models, and supplementary material are available on the project website1. 1https://pritamqu.github.io/XKD |
| Open Datasets | Yes | We pretrain XKD on 3 datasets of different sizes including the small-scale Kinetics-Sound (Arandjelovic and Zisserman 2017), large-scale Kinetics400 (Kay et al. 2017), and very large-scale Audio Set (Gemmeke et al. 2017). We evaluate our proposed method on a variety of downstream tasks including video action recognition, sound classification, and multimodal action classification. We use a total of 6 datasets for downstream tasks, namely Kinetics400 (K400) (Kay et al. 2017), Kinetics-Sound (KS) (Arandjelovic and Zisserman 2017), UCF101 (U101) (Soomro, Zamir, and Shah 2012), HMDB51 (H51) (Kuehne et al. 2011), ESC50 (E50) (Piczak 2015), and FSD50K (F50K) (Fonseca et al. 2022). |
| Dataset Splits | No | We redirect readers to see the details on evaluation protocols in the Suppl. Mat. Sec. A.5 and A.6. The main text refers to "linear evaluation and finetuning setup" and "split-1" but does not provide explicit percentages or counts for train/validation/test splits within the main paper. |
| Hardware Specification | No | The paper mentions using "Sci Net HPC Consortium for helping with the computation resources" in the acknowledgments but does not specify any particular hardware like GPU models, CPU types, or cloud instances used for running experiments. |
| Software Dependencies | No | The paper refers to architectural choices like Vi T but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | During pretraining, to save computation we downsample the video input at 8 FPS and resize the frame resolution at 1122. Additionally, we re-sample the audio waveforms at 16 k Hz. and generate mel-spectrograms using 80 mel filters. We use a patch size of 4 16 for audio spectrograms and a cuboid size of 4 162 for video input. Empirically, we set λae, λda, and λkd as 5, 1, and 1 respectively. |