The Modality Focusing Hypothesis: Towards Understanding Crossmodal Knowledge Distillation
Authors: Zihui Xue, Zhengqi Gao, Sucheng Ren, Hang Zhao
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate crossmodal KD on a few multimodal tasks and find surprisingly that teacher performance does not always positively correlate with student performance. To explore the cause of performance mismatch in crossmodal KD, we adopt the modality Venn diagram (MVD) to understand modality relationships and formally define modalitygeneral decisive features and modality-specific decisive features. We present the modality focusing hypothesis (MFH) that provides an explanation of when crossmodal KD is effective. We hypothesize that modality-general decisive features are the crucial factor that determines the efficacy of crossmodal KD. We conduct experiments on 6 multimodal datasets (i.e., synthetic Gaussian, AV-MNIST, RAVDESS, VGGSound, NYU Depth V2, and MM-IMDB). The results validate the proposed MFH and provide insights on how to improve crossmodal KD. |
| Researcher Affiliation | Academia | 1 The University of Texas at Austin 2 Massachusetts Institute of Technology 3 South China University of Technology 4 Tsinghua University, Shanghai Qi Zhi Institute |
| Pseudocode | Yes | Algorithm 1 Modality-General Decisive Feature Ranking |
| Open Source Code | Yes | Our code is available at https://github.com/zihuixue/MFH. |
| Open Datasets | Yes | We conduct experiments on 6 multimodal datasets (synthetic Gaussian, AV-MNIST, RAVDESS, VGGSound, NYU Depth V2, and MM-IMDB)... AV-MNIST (Vielzeuf et al., 2018)... NYU Depth V2 (Nathan Silberman & Fergus, 2012)... RAVDESS (Livingstone & Russo, 2018)... VGGSound (Chen et al., 2020a)... MM-IMDB (Arevalo et al., 2017). |
| Dataset Splits | Yes | AV-MNIST... There are 50,000 pairs for training, 5,000 pairs for validation and 10,000 pairs for testing. ... NYU Depth V2... 795 images are used for training and 654 images are for testing. ... RAVDESS... 7,943 data for training, 2,364 data for validation and 1,001 data for testing. ... MM-IMDB... There are 15,552 data for training, 2,608 for validation, and 7,799 for testing. ... VGGSound... 56,614 audio-video pairs for training and 4,501 audio-video pairs for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used to conduct the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x) that would be needed for reproducibility. |
| Experiment Setup | Yes | Recall that ρ in Eq. (1) in the main text controls the relative importance of the two loss terms when training the student network. We experiment with both ρ = 0 and ρ = 0.5, and repeat the experiments for 10 times. ... ρ is set to 0.5. ... We set ρ in Eq. (1) in the main text as 0 (i.e., only use Lkd for distillation) to fully observe the teacher s influence on student performance. |