reproducibilityindex.ai

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Authors: Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach on standard image and video datasets Image Net-21K, Kinetics400, and Something-Something-v2. and All results presented in this work are empirical.
Researcher Affiliation	Collaboration	Google DeepMind University of Washington {jaingagan,sujoyp}@google.com
Pseudocode	Yes	Algorithm 1 Expert Preferred Routing (EPR)
Open Source Code	No	The datasets and codebase on top of which we build our algorithm are opensourced, we will open-source the code for this paper upon acceptance.
Open Datasets	Yes	We validate our approach on standard image and video datasets Image Net-21K, Kinetics400, and Something-Something-v2. and mentions Image Net-21k [18], Kinetics-400 [31], and Something-Something-v2 (SSv2) [24] with corresponding references.
Dataset Splits	No	The paper mentions using standard datasets and inheriting hyperparameter values from previous literature (Vi Vi T [2] paper) but does not explicitly state the specific train/validation/test dataset split percentages or sample counts within the text.
Hardware Specification	Yes	Real Time Latency and Throughput gains for Mo NE on a single V100 GPU and We use a maximum of 64 TPU v3 chips per training experiment.
Software Dependencies	No	We implement Mo NE on JAX [9] using Big Vision [7] for image classiﬁcation and Scenic [16] for video classiﬁcation.
Experiment Setup	No	We empirically evaluate Mo NE on image and video classiﬁcation. For image classiﬁcation, we train the network with random initialization. As for video classiﬁcation, we follow previous literature and start from a pre-trained Mat Vi T [19] model due to the inherent nested structure required in Mo NE. We follow the joint training strategy of Mat Vi T, with separate losses an all model granularities. [...] We follow the Aug Reg [43] training strategy to train all our image classiﬁcation models. For video classiﬁcation tasks, we inherit all augmentations and hyperparameter values directly from the Vi Vi T [2] paper.