Cross-modal Representation Flattening for Multi-modal Domain Generalization

Authors: Yunfeng FAN, Wenchao Xu, Haozhao Wang, Song Guo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are performed on two benchmark datasets, EPIC-Kitchens and Human Animal-Cartoon (HAC), with various modality combinations, demonstrating the effectiveness of our method under multi-source and single-source settings.
Researcher Affiliation Academia Yunfeng Fan1, Wenchao Xu1, , Haohao Wang2, Song Guo3 1Department of Computing, The Hong Kong Polytechnic University, 2School of Computer Science and Technology, Huazhong University of Science and Technology, 3Hong Kong University of Science and Technology
Pseudocode No The paper does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code Yes Our code is open-sourced 1. 1https://github.com/fanyunfeng-bit/Cross-modal-Representation-Flattening-for-MMDG
Open Datasets Yes We utilize two benchmark datasets, EPIC-Kitchens [40] and Human-Animal-Cartoon (HAC) [28]
Dataset Splits Yes For all methods, we follow [41] and select the model with best validation (in-domain) accuracy to evaluate generalization on test (out-of-domain) data.
Hardware Specification Yes All experiments were conducted on an NVIDIA Ge Force RTX 3090 GPU with a 3.9-GHz Intel Core i9-12900K CPU.
Software Dependencies No The paper mentions 'MMAction2 toolkit [44]' and 'Adam optimizer [49]' but does not provide specific version numbers for these or other key software components.
Experiment Setup Yes The dimensions of the uni-modal feature h are 2304 for video, 512 for audio, and 2048 for optical flow. For the projector Projk ( ), we implement a multi-layer perceptron with two hidden layers of size 2048 and output size 128. We use the Adam optimizer [49] with a learning rate of 0.0001 and a batch size of 16. The scalar temperature parameter τ is set to 0.1. Additionally, we set λ1 = 2.0, λ2 = λ3 = 3.0, α in the Beta distribution to 0.1, and the SMA start iteration t0 to 400 for EPIC-Kitchens and 100 for HAC respectively. The model is trained with 15 epochs, taking two hours.