Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Attention Bottlenecks for Multimodal Fusion
Authors: Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. |
| Researcher Affiliation | Industry | Arsha Nagrani Shan Yang Anurag Arnab Aren Jansen Cordelia Schmid Chen Sun EMAIL Google Research |
| Pseudocode | No | The paper describes the architecture and processes in detail but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | All code and models will be released. |
| Open Datasets | Yes | We experiment with three video classification datasets Audio Set [21], Epic-Kitchens-100 [12] and VGGSound [10], described in more detail below. |
| Dataset Splits | No | For Audio Set, the paper states '20,361 clips for the balanced train set (henceforth referred to as mini-Audio Set or mini AS) and 18,589 clips for the test set.' For VGGSound, '172,427 training and 14,448 test clips.' While train/test splits are provided, explicit details for a validation set split are not present. |
| Hardware Specification | Yes | All models (across datasets) are trained with a batch size of 64, synchronous SGD with momentum of 0.9, and a cosine learning rate schedule with warmup of 2.5 epochs on TPU accelerators using the Scenic library [13]. |
| Software Dependencies | No | The paper mentions using the 'Scenic library [13]' but does not provide a specific version number for it. |
| Experiment Setup | Yes | Our backbone architecture follows that of Vi T [16] identically, specifically we use Vi T-Base (Vi T-B, L = 12, NH = 12, d = 3072)... Unless otherwise specialised, we use B = 4 bottleneck tokens for all experiments... We set the base learning rate to 0.5 and train for 50 epochs, using Mixup [59] with α = 0.3 and stochastic depth regularisation [27] with probability p = 0.3. All models (across datasets) are trained with a batch size of 64, synchronous SGD with momentum of 0.9, and a cosine learning rate schedule with warmup of 2.5 epochs. |