Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

Authors: Elad Amrani, Rami Ben-Ari, Daniel Rotman, Alex Bronstein6644-6652

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate how our noise estimation can be broadly integrated and achieves comparable results to state-of-the-art performance on five different benchmark datasets for two challenging multimodal tasks: Video Question Answering and Text-To-Video Retrieval. Furthermore, we provide a theoretical probabilistic error bound substantiating our empirical results and analyze failure cases.
Researcher Affiliation Collaboration Elad Amrani1,2, Rami Ben-Ari1, Daniel Rotman1, Alex Bronstein2 1IBM Research AI 2Technion
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code: https://github.com/elad-amrani/ssml.
Open Datasets Yes Ultimately, we integrate our proposed building block into an embedding model and learn superior joint video-text representations that achieve comparable state-of-the-art performance on five datasets: MSRVTT (Xu et al. 2016), LSMDC (Rohrbach et al. 2015), MSVD (Chen and Dolan 2011), MSRVTT-QA (Xu et al. 2017) and MSVD-QA (Xu et al. 2017); for two different tasks: Video Question Answering and Text to Video Retrieval. We train our model using the How To100M (Miech et al. 2019) narrated video dataset
Dataset Splits No The paper mentions training and evaluation on specific datasets (e.g., MSRVTT, LSMDC, MSVD, How To100M) and states 'See extended version (Amrani et al. 2020) for detailed statistics of each dataset,' implying that specific dataset splits are not detailed within the main paper.
Hardware Specification Yes Training the model on the large How To100M dataset is done on a single V100 GPU and takes less than 24 hours.
Software Dependencies No The paper mentions several software components and models (e.g., word2vec, ADAM optimizer, FAISS, Resnet-152, ResNeXt-101) with citations, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We use dv = 4096, dc = 300, and d = 6144. We use the ADAM (Kingma and Ba 2015) optimizer with a fixed learning rate of 10 3.