Diversity and Depth in Per-Example Routing Models

Authors: Prajit Ramachandran, Quoc V. Le

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we find that adding architectural diversity to routing models significantly improves performance, cutting the error rates of a strong baseline by 35% on an Omniglot setup. However, when scaling up routing depth, we find that modern routing techniques struggle with optimization.
Researcher Affiliation Industry Prajit Ramachandran Google Brain prajit@google.com Quoc V. Le Google Brain qvl@google.com
Pseudocode No The paper provides mathematical formulas and descriptions of processes, but no formal pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code, nor does it include links to repositories or mention code in supplementary materials.
Open Datasets Yes Next, we benchmark routing models with architectural diversity on an Omniglot (Lake et al., 2015) multi-task learning setup.
Dataset Splits Yes We follow Liang et al. (2018) by defining a 50%/20%/30% training/validation/test split and using a fixed random subset of 20 alphabets.
Hardware Specification No The paper only states 'on a single GPU' without providing specific details like the GPU model, CPU, or memory, which are necessary for hardware reproducibility.
Software Dependencies No The paper mentions optimizers (Adam) and normalization techniques (Group Norm, ReLU) but does not provide specific software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch versions, library versions).
Experiment Setup Yes k is annealed from 7 to 2 over the layers. We found the k-annealing technique crucial to prevent overfitting. The Adam optimizer (Kingma & Ba, 2014) is used, and the expert-balancing for noisy top-k loss is annealed from 0.1 to 0 over the course of training.