Attention Bottlenecks for Multimodal Fusion

Authors: Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound.
Researcher Affiliation Industry Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid Chen Sun {anagrani, shanyang, aarnab, arenjansen, cordelias, chensun}@google.com Google Research
Pseudocode No The paper describes the architecture and processes in detail but does not include structured pseudocode or algorithm blocks.
Open Source Code No All code and models will be released.
Open Datasets Yes We experiment with three video classification datasets Audio Set [21], Epic-Kitchens-100 [12] and VGGSound [10], described in more detail below.
Dataset Splits No For Audio Set, the paper states '20,361 clips for the balanced train set (henceforth referred to as mini-Audio Set or mini AS) and 18,589 clips for the test set.' For VGGSound, '172,427 training and 14,448 test clips.' While train/test splits are provided, explicit details for a validation set split are not present.
Hardware Specification Yes All models (across datasets) are trained with a batch size of 64, synchronous SGD with momentum of 0.9, and a cosine learning rate schedule with warmup of 2.5 epochs on TPU accelerators using the Scenic library [13].
Software Dependencies No The paper mentions using the 'Scenic library [13]' but does not provide a specific version number for it.
Experiment Setup Yes Our backbone architecture follows that of Vi T [16] identically, specifically we use Vi T-Base (Vi T-B, L = 12, NH = 12, d = 3072)... Unless otherwise specialised, we use B = 4 bottleneck tokens for all experiments... We set the base learning rate to 0.5 and train for 50 epochs, using Mixup [59] with α = 0.3 and stochastic depth regularisation [27] with probability p = 0.3. All models (across datasets) are trained with a batch size of 64, synchronous SGD with momentum of 0.9, and a cosine learning rate schedule with warmup of 2.5 epochs.