Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Authors: Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach on standard image and video datasets Image Net-21K, Kinetics400, and Something-Something-v2. and All results presented in this work are empirical.
Researcher Affiliation Collaboration Google DeepMind University of Washington {jaingagan,sujoyp}@google.com
Pseudocode Yes Algorithm 1 Expert Preferred Routing (EPR)
Open Source Code No The datasets and codebase on top of which we build our algorithm are opensourced, we will open-source the code for this paper upon acceptance.
Open Datasets Yes We validate our approach on standard image and video datasets Image Net-21K, Kinetics400, and Something-Something-v2. and mentions Image Net-21k [18], Kinetics-400 [31], and Something-Something-v2 (SSv2) [24] with corresponding references.
Dataset Splits No The paper mentions using standard datasets and inheriting hyperparameter values from previous literature (Vi Vi T [2] paper) but does not explicitly state the specific train/validation/test dataset split percentages or sample counts within the text.
Hardware Specification Yes Real Time Latency and Throughput gains for Mo NE on a single V100 GPU and We use a maximum of 64 TPU v3 chips per training experiment.
Software Dependencies No We implement Mo NE on JAX [9] using Big Vision [7] for image classification and Scenic [16] for video classification.
Experiment Setup No We empirically evaluate Mo NE on image and video classification. For image classification, we train the network with random initialization. As for video classification, we follow previous literature and start from a pre-trained Mat Vi T [19] model due to the inherent nested structure required in Mo NE. We follow the joint training strategy of Mat Vi T, with separate losses an all model granularities. [...] We follow the Aug Reg [43] training strategy to train all our image classification models. For video classification tasks, we inherit all augmentations and hyperparameter values directly from the Vi Vi T [2] paper.