Learning Unseen Modality Interaction

Authors: Yunhua Zhang, Hazel Doughty, Cees Snoek

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.To evaluate this new problem, we reorganize existing datasets and tasks for video classification, robotic state regression, and multimedia retrieval on a variety of modality combinations.
Researcher Affiliation Academia Yunhua Zhang, Hazel Doughty, Cees G.M. Snoek University of Amsterdam
Pseudocode No The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Project website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.
Open Datasets Yes Since unseen modality interaction is a new problem, we repurpose and reorganize existing multimodal datasets [6, 39, 18] which contain a variety of modalities, as summarized in Table 1.
Dataset Splits Yes Training has 62,429 samples, validation 6,750 and testing 6,641.Validation and testing contain 20,874 and 17,738 samples and use all four modalities.The validation and test sets have 127 and 765 video samples respectively.
Hardware Specification Yes We use three NVIDIA RTX A6000 GPUs to train our model with a batch size of 96 for video classification, while a single NVIDIA RTX 2080Ti GPU with a batch size of 128 for robotic state regression and multimedia retrieval.
Software Dependencies No The paper mentions several models and frameworks (e.g., Swin-T, ResNet, AST), but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA versions).
Experiment Setup Yes Both of our feature projection module and transformer layers for prediction consist of six transformer layers [7], each with 8 heads and hidden dimension d of 256. The number of learnable tokens for the feature alignment loss is set to nu=3806 on EPIC-Kitchens, and nu=128 on MSR-VTT and the robot dataset. The length of the feature tokens after feature projection is set as k =512 on EPIC-Kitchens, and k =16 on MSR-VTT and the robot dataset.Our method is trained with 120 epochs on video classification with a learning rate of 10 4, reduced to 10 5 for the last 50 epochs. On robot state regression and multimedia retrieval, we train our method with 50 epochs and a learning rate of 10 2.