reproducibilityindex.ai

The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation

Authors: Zihui Xue, Zhengqi Gao, Sucheng Ren, Hang Zhao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate crossmodal KD on a few multimodal tasks and find surprisingly that teacher performance does not always positively correlate with student performance. To explore the cause of performance mismatch in crossmodal KD, we adopt the modality Venn diagram (MVD) to understand modality relationships and formally define modalitygeneral decisive features and modality-specific decisive features. We present the modality focusing hypothesis (MFH) that provides an explanation of when crossmodal KD is effective. We hypothesize that modality-general decisive features are the crucial factor that determines the efficacy of crossmodal KD. We conduct experiments on 6 multimodal datasets (i.e., synthetic Gaussian, AV-MNIST, RAVDESS, VGGSound, NYU Depth V2, and MM-IMDB). The results validate the proposed MFH and provide insights on how to improve crossmodal KD.
Researcher Affiliation	Academia	1 The University of Texas at Austin 2 Massachusetts Institute of Technology 3 South China University of Technology 4 Tsinghua University, Shanghai Qi Zhi Institute
Pseudocode	Yes	Algorithm 1 Modality-General Decisive Feature Ranking
Open Source Code	Yes	Our code is available at https://github.com/zihuixue/MFH.
Open Datasets	Yes	We conduct experiments on 6 multimodal datasets (synthetic Gaussian, AV-MNIST, RAVDESS, VGGSound, NYU Depth V2, and MM-IMDB)... AV-MNIST (Vielzeuf et al., 2018)... NYU Depth V2 (Nathan Silberman & Fergus, 2012)... RAVDESS (Livingstone & Russo, 2018)... VGGSound (Chen et al., 2020a)... MM-IMDB (Arevalo et al., 2017).
Dataset Splits	Yes	AV-MNIST... There are 50,000 pairs for training, 5,000 pairs for validation and 10,000 pairs for testing. ... NYU Depth V2... 795 images are used for training and 654 images are for testing. ... RAVDESS... 7,943 data for training, 2,364 data for validation and 1,001 data for testing. ... MM-IMDB... There are 15,552 data for training, 2,608 for validation, and 7,799 for testing. ... VGGSound... 56,614 audio-video pairs for training and 4,501 audio-video pairs for testing.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used to conduct the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x) that would be needed for reproducibility.
Experiment Setup	Yes	Recall that ρ in Eq. (1) in the main text controls the relative importance of the two loss terms when training the student network. We experiment with both ρ = 0 and ρ = 0.5, and repeat the experiments for 10 times. ... ρ is set to 0.5. ... We set ρ in Eq. (1) in the main text as 0 (i.e., only use Lkd for distillation) to fully observe the teacher s influence on student performance.