reproducibilityindex.ai

Attention Bottlenecks for Multimodal Fusion

Authors: Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound.
Researcher Affiliation	Industry	Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid Chen Sun {anagrani, shanyang, aarnab, arenjansen, cordelias, chensun}@google.com Google Research
Pseudocode	No	The paper describes the architecture and processes in detail but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	All code and models will be released.
Open Datasets	Yes	We experiment with three video classification datasets Audio Set [21], Epic-Kitchens-100 [12] and VGGSound [10], described in more detail below.
Dataset Splits	No	For Audio Set, the paper states '20,361 clips for the balanced train set (henceforth referred to as mini-Audio Set or mini AS) and 18,589 clips for the test set.' For VGGSound, '172,427 training and 14,448 test clips.' While train/test splits are provided, explicit details for a validation set split are not present.
Hardware Specification	Yes	All models (across datasets) are trained with a batch size of 64, synchronous SGD with momentum of 0.9, and a cosine learning rate schedule with warmup of 2.5 epochs on TPU accelerators using the Scenic library [13].
Software Dependencies	No	The paper mentions using the 'Scenic library [13]' but does not provide a specific version number for it.
Experiment Setup	Yes	Our backbone architecture follows that of Vi T [16] identically, specifically we use Vi T-Base (Vi T-B, L = 12, NH = 12, d = 3072)... Unless otherwise specialised, we use B = 4 bottleneck tokens for all experiments... We set the base learning rate to 0.5 and train for 50 epochs, using Mixup [59] with α = 0.3 and stochastic depth regularisation [27] with probability p = 0.3. All models (across datasets) are trained with a batch size of 64, synchronous SGD with momentum of 0.9, and a cosine learning rate schedule with warmup of 2.5 epochs.