On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

Authors: Chenzhuang Du, Jiaye Teng, Tingle Li, Yichen Liu, Tianyuan Yuan, Yue Wang, Yang Yuan, Hang Zhao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets, including VGG-Sound, Kinetics-400, UCF101, and Model Net40.
Researcher Affiliation Collaboration 1IIIS, Tsinghua University 2University of California, Berkeley 3Massachusetts Institute of Technology 4Shanghai Artificial Intelligence Laboratory 5Shanghai Qi Zhi Institute.
Pseudocode Yes Algorithm 1 Uni-Modal Teacher (UMT)
Open Source Code Yes We also provide our code in supplementary material in case of ambiguity.
Open Datasets Yes Kinetics-400 dataset (Kay et al., 2017) contains over 240k videos for training and 19k for validation, which we download from cvdfoundation. VGG-Sound dataset (Chen et al., 2020b), which contains over 200k video clips for 309 different sound classes, is also used for evaluating our method. UCF101 dataset (Soomro et al., 2012) is an action recognition dataset with 101 action categories, including 7k videos for training and 3k for testing. Model Net40 is a 3D object classification task with 9,483 training samples and 2,468 test samples. Following Wu et al. (2022), we treat the front and rear view as two modalities.
Dataset Splits Yes Kinetics-400 dataset (Kay et al., 2017) contains over 240k videos for training and 19k for validation... UCF101 dataset (Soomro et al., 2012) is an action recognition dataset with 101 action categories, including 7k videos for training and 3k for testing.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for running the experiments. It only mentions general setups like 'Res Net3D (Video), 2D (Audio)' for encoders and training parameters, without hardware details.
Software Dependencies No The paper mentions software tools like 'Py Slow Fast', 'mmaction2', and 'nn Audio (Cheuk et al., 2020)'. However, it does not specify version numbers for these tools or for general programming languages/libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes We show the hyperparameters of our experiments in UCF101 and VGG-Sound in Table 11. As for Kinetics-400 s RGB modality, we totally follow the hyperparameters and settings of Py Slow Fast. As for audio modality, we modify the hyperparameters to be as consistent as possible with the RGB training for further joint training. Specifically, we use the same learning rate and batch size as RGB training used. As for Model Net40, we totally follow the experimental settings of Wu et al. (2022).