reproducibilityindex.ai

Learning Unseen Modality Interaction

Authors: Yunhua Zhang, Hazel Doughty, Cees Snoek

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that our approach is effective for diverse tasks and modalities by evaluating it for multimodal video classiﬁcation, robot state regression, and multimedia retrieval.To evaluate this new problem, we reorganize existing datasets and tasks for video classiﬁcation, robotic state regression, and multimedia retrieval on a variety of modality combinations.
Researcher Affiliation	Academia	Yunhua Zhang, Hazel Doughty, Cees G.M. Snoek University of Amsterdam
Pseudocode	No	The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Project website: https://xiaobai1217.github.io/Unseen-Modality-Interaction/.
Open Datasets	Yes	Since unseen modality interaction is a new problem, we repurpose and reorganize existing multimodal datasets [6, 39, 18] which contain a variety of modalities, as summarized in Table 1.
Dataset Splits	Yes	Training has 62,429 samples, validation 6,750 and testing 6,641.Validation and testing contain 20,874 and 17,738 samples and use all four modalities.The validation and test sets have 127 and 765 video samples respectively.
Hardware Specification	Yes	We use three NVIDIA RTX A6000 GPUs to train our model with a batch size of 96 for video classiﬁcation, while a single NVIDIA RTX 2080Ti GPU with a batch size of 128 for robotic state regression and multimedia retrieval.
Software Dependencies	No	The paper mentions several models and frameworks (e.g., Swin-T, ResNet, AST), but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA versions).
Experiment Setup	Yes	Both of our feature projection module and transformer layers for prediction consist of six transformer layers [7], each with 8 heads and hidden dimension d of 256. The number of learnable tokens for the feature alignment loss is set to nu=3806 on EPIC-Kitchens, and nu=128 on MSR-VTT and the robot dataset. The length of the feature tokens after feature projection is set as k =512 on EPIC-Kitchens, and k =16 on MSR-VTT and the robot dataset.Our method is trained with 120 epochs on video classiﬁcation with a learning rate of 10 4, reduced to 10 5 for the last 50 epochs. On robot state regression and multimedia retrieval, we train our method with 50 epochs and a learning rate of 10 2.