Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Authors: Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach on standard image and video datasets Image Net-21K, Kinetics400, and Something-Something-v2. and All results presented in this work are empirical. |
| Researcher Affiliation | Collaboration | Google DeepMind University of Washington {jaingagan,sujoyp}@google.com |
| Pseudocode | Yes | Algorithm 1 Expert Preferred Routing (EPR) |
| Open Source Code | No | The datasets and codebase on top of which we build our algorithm are opensourced, we will open-source the code for this paper upon acceptance. |
| Open Datasets | Yes | We validate our approach on standard image and video datasets Image Net-21K, Kinetics400, and Something-Something-v2. and mentions Image Net-21k [18], Kinetics-400 [31], and Something-Something-v2 (SSv2) [24] with corresponding references. |
| Dataset Splits | No | The paper mentions using standard datasets and inheriting hyperparameter values from previous literature (Vi Vi T [2] paper) but does not explicitly state the specific train/validation/test dataset split percentages or sample counts within the text. |
| Hardware Specification | Yes | Real Time Latency and Throughput gains for Mo NE on a single V100 GPU and We use a maximum of 64 TPU v3 chips per training experiment. |
| Software Dependencies | No | We implement Mo NE on JAX [9] using Big Vision [7] for image classification and Scenic [16] for video classification. |
| Experiment Setup | No | We empirically evaluate Mo NE on image and video classification. For image classification, we train the network with random initialization. As for video classification, we follow previous literature and start from a pre-trained Mat Vi T [19] model due to the inherent nested structure required in Mo NE. We follow the joint training strategy of Mat Vi T, with separate losses an all model granularities. [...] We follow the Aug Reg [43] training strategy to train all our image classification models. For video classification tasks, we inherit all augmentations and hyperparameter values directly from the Vi Vi T [2] paper. |