Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient Algorithms

Authors: Ashok Makkuva, Pramod Viswanath, Sreeram Kannan, Sewoong Oh

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our algorithm on both the synthetic and real data sets in a variety of settings, and show superior performance to standard baselines.
Researcher Affiliation Academia 1Department of Electrical and Computer Engineering, Coordinated Science Laboratory, University of Illinois at Urbana Champaign, IL, USA 2Allen School of Computer Science & Engineering, University of Washington, Seattle, USA 3Department of Electrical Engineering, University of Washington, Seattle, USA.
Pseudocode Yes Algorithm 1 Learning the regressors... Algorithm 2 Learning the gating parameter
Open Source Code Yes Codes are available at this repository Mo E codes.
Open Datasets Yes To highlight the generalizability of our algorithm, in Appendix H.2 of the supplement, we compare the performance of our algorithm to that of the standard approaches on a variety of real world datasets. References include: Brooks, T., Pope, D., and Marcolini., A. Airfoil self-noise and prediction. Technical report, NASA, 1989. URL https://archive.ics.uci.edu/ ml/datasets/Airfoil+Self-Noise. Liu, Y.-C. and Yeh, I.-C. Using mixture design and neural networks to build stock selection decision support systems. Neural Computing and Applications, 28(3): 521 535, 2017. doi: 10.1007/s00521-015-2090-x. URL https://archive.ics.uci.edu/ml/ datasets/Stock+portfolio+performance. Yeh, I.-C. Modeling of strength of high performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797 1808, 1998. URL https: //archive.ics.uci.edu/ml/datasets/ Concrete+Compressive+Strength.
Dataset Splits No The paper describes generating synthetic data with parameters like n=2000 or n=8000 and d=10, and also mentions using real-world datasets, but it does not specify explicit training, validation, or test splits (e.g., percentages or counts) for these datasets.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions using the "Orth-ALS package by (Sharan & Valiant, 2017)" but does not provide a specific version number for this or any other software dependency.
Experiment Setup Yes For the experiments, we consider the similar setting as before with k = 2, d = 10, σ = 0.1 and the gating parameter w is drawn uniformly from S9 without the orthogonality restriction. We let xi i.i.d. N(0, Id). We choose n = 2000... We let the number of mixture components be k = 3 and k = 4. We let x N(0, Id) and the gating parameters are drawn uniformly from S9... n = 8000, d = 10, σ = 0.5.