Learning Unseen Modality Interaction
Authors: Yunhua Zhang, Hazel Doughty, Cees Snoek
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classification, robot state regression, and multimedia retrieval.To evaluate this new problem, we reorganize existing datasets and tasks for video classification, robotic state regression, and multimedia retrieval on a variety of modality combinations. |
| Researcher Affiliation | Academia | Yunhua Zhang, Hazel Doughty, Cees G.M. Snoek University of Amsterdam |
| Pseudocode | No | The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Project website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/. |
| Open Datasets | Yes | Since unseen modality interaction is a new problem, we repurpose and reorganize existing multimodal datasets [6, 39, 18] which contain a variety of modalities, as summarized in Table 1. |
| Dataset Splits | Yes | Training has 62,429 samples, validation 6,750 and testing 6,641.Validation and testing contain 20,874 and 17,738 samples and use all four modalities.The validation and test sets have 127 and 765 video samples respectively. |
| Hardware Specification | Yes | We use three NVIDIA RTX A6000 GPUs to train our model with a batch size of 96 for video classification, while a single NVIDIA RTX 2080Ti GPU with a batch size of 128 for robotic state regression and multimedia retrieval. |
| Software Dependencies | No | The paper mentions several models and frameworks (e.g., Swin-T, ResNet, AST), but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA versions). |
| Experiment Setup | Yes | Both of our feature projection module and transformer layers for prediction consist of six transformer layers [7], each with 8 heads and hidden dimension d of 256. The number of learnable tokens for the feature alignment loss is set to nu=3806 on EPIC-Kitchens, and nu=128 on MSR-VTT and the robot dataset. The length of the feature tokens after feature projection is set as k =512 on EPIC-Kitchens, and k =16 on MSR-VTT and the robot dataset.Our method is trained with 120 epochs on video classification with a learning rate of 10 4, reduced to 10 5 for the last 50 epochs. On robot state regression and multimedia retrieval, we train our method with 50 epochs and a learning rate of 10 2. |