Attention Bottlenecks for Multimodal Fusion
Authors: Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. |
| Researcher Affiliation | Industry | Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid Chen Sun {anagrani, shanyang, aarnab, arenjansen, cordelias, chensun}@google.com Google Research |
| Pseudocode | No | The paper describes the architecture and processes in detail but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | All code and models will be released. |
| Open Datasets | Yes | We experiment with three video classification datasets Audio Set [21], Epic-Kitchens-100 [12] and VGGSound [10], described in more detail below. |
| Dataset Splits | No | For Audio Set, the paper states '20,361 clips for the balanced train set (henceforth referred to as mini-Audio Set or mini AS) and 18,589 clips for the test set.' For VGGSound, '172,427 training and 14,448 test clips.' While train/test splits are provided, explicit details for a validation set split are not present. |
| Hardware Specification | Yes | All models (across datasets) are trained with a batch size of 64, synchronous SGD with momentum of 0.9, and a cosine learning rate schedule with warmup of 2.5 epochs on TPU accelerators using the Scenic library [13]. |
| Software Dependencies | No | The paper mentions using the 'Scenic library [13]' but does not provide a specific version number for it. |
| Experiment Setup | Yes | Our backbone architecture follows that of Vi T [16] identically, specifically we use Vi T-Base (Vi T-B, L = 12, NH = 12, d = 3072)... Unless otherwise specialised, we use B = 4 bottleneck tokens for all experiments... We set the base learning rate to 0.5 and train for 50 epochs, using Mixup [59] with α = 0.3 and stochastic depth regularisation [27] with probability p = 0.3. All models (across datasets) are trained with a batch size of 64, synchronous SGD with momentum of 0.9, and a cosine learning rate schedule with warmup of 2.5 epochs. |