reproducibilityindex.ai

XKD: Cross-Modal Knowledge Distillation with Domain Alignment for Video Representation Learning

Authors: Pritam Sarkar, Ali Etemad

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our proposed method on a variety of downstream tasks including video action recognition, sound classification, and multimodal action classification. We use a total of 6 datasets for downstream tasks, namely Kinetics400 (K400) (Kay et al. 2017), Kinetics-Sound (KS) (Arandjelovic and Zisserman 2017), UCF101 (U101) (Soomro, Zamir, and Shah 2012), HMDB51 (H51) (Kuehne et al. 2011), ESC50 (E50) (Piczak 2015), and FSD50K (F50K) (Fonseca et al. 2022).
Researcher Affiliation	Academia	Pritam Sarkar1, 2, Ali Etemad1 1 Queen s University, Canada 2 Vector Institute {pritam.sarkar, ali.etemad}@queensu.ca
Pseudocode	Yes	Algorithm 1: XKD Training Algorithm.
Open Source Code	Yes	The code, pretrained models, and supplementary material are available on the project website1. 1https://pritamqu.github.io/XKD
Open Datasets	Yes	We pretrain XKD on 3 datasets of different sizes including the small-scale Kinetics-Sound (Arandjelovic and Zisserman 2017), large-scale Kinetics400 (Kay et al. 2017), and very large-scale Audio Set (Gemmeke et al. 2017). We evaluate our proposed method on a variety of downstream tasks including video action recognition, sound classification, and multimodal action classification. We use a total of 6 datasets for downstream tasks, namely Kinetics400 (K400) (Kay et al. 2017), Kinetics-Sound (KS) (Arandjelovic and Zisserman 2017), UCF101 (U101) (Soomro, Zamir, and Shah 2012), HMDB51 (H51) (Kuehne et al. 2011), ESC50 (E50) (Piczak 2015), and FSD50K (F50K) (Fonseca et al. 2022).
Dataset Splits	No	We redirect readers to see the details on evaluation protocols in the Suppl. Mat. Sec. A.5 and A.6. The main text refers to "linear evaluation and finetuning setup" and "split-1" but does not provide explicit percentages or counts for train/validation/test splits within the main paper.
Hardware Specification	No	The paper mentions using "Sci Net HPC Consortium for helping with the computation resources" in the acknowledgments but does not specify any particular hardware like GPU models, CPU types, or cloud instances used for running experiments.
Software Dependencies	No	The paper refers to architectural choices like Vi T but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup	Yes	During pretraining, to save computation we downsample the video input at 8 FPS and resize the frame resolution at 1122. Additionally, we re-sample the audio waveforms at 16 k Hz. and generate mel-spectrograms using 80 mel filters. We use a patch size of 4 16 for audio spectrograms and a cuboid size of 4 162 for video input. Empirically, we set λae, λda, and λkd as 5, 1, and 1 respectively.