Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts

Authors: Huy Nguyen, Pedram Akbarian, Fanqi Yan, Nhat Ho

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By focusing on a Gaussian mixture of experts, we establish theoretical results on the effects of the top-K sparse softmax gating function on both density and parameter estimations. Our results hinge upon defining novel loss functions among parameters to capture different behaviors of the input regions. In this supplementary material, we first provide rigorous proofs for all results under the exact-specified settings in Appendix A, while those for the over-specified settings are then presented in Appendix B. Next, we study the identifiability of the top-K sparse softmax gating Gaussian mixture of experts (Mo E) in Appendix C. We then carry out several numerical experiments in Appendix D to empirically justify our theoretical results.
Researcher Affiliation Academia Department of Statistics and Data Sciences The University of Texas at Austin Austin, TX 78712 huynm@utexas.edu Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712 akbarian@utexas.edu Department of Computer Science The University of Texas at Austin Austin, TX 78712 fanqi.yan@utexas.edu Department of Statistics and Data Sciences The University of Texas at Austin Austin, TX 78712 minhnhat@utexas.edu
Pseudocode No The paper discusses the use of the EM algorithm and a coordinate gradient descent algorithm, but it does not include any pseudocode blocks, algorithm figures, or sections explicitly labeled "Algorithm".
Open Source Code No The paper does not contain any explicit statements about making its source code available, nor does it provide a link to a code repository for the methodology described.
Open Datasets No The paper states in Section D.1, "we generate i.i.d samples {(Xi, Yi)}n i=1 by first sampling Xi s from the uniform distribution Uniform[0, 1] and then sampling Yi s from the true conditional density g G (Y |X)..." This indicates that the data used for experiments is synthetically generated, not a publicly available or open dataset.
Dataset Splits No The paper describes generating synthetic data for different sample sizes and running multiple sample generations and replications (e.g., "40 sample generations for each configuration, across a spectrum of 200 different sample sizes"). This methodology does not involve predefined training, validation, or test dataset splits in the traditional sense, as data is simulated for each run.
Hardware Specification No The paper mentions conducting "numerical experiments" in Appendix D but does not provide any specific details about the hardware used for these experiments. There is no mention of GPU models, CPU types, or other hardware specifications.
Software Dependencies No The paper describes the use of an "EM-based numerical scheme" and a "simple coordinate gradient descent algorithm." However, it does not specify any software dependencies with version numbers for libraries, frameworks, or programming languages (e.g., Python version, PyTorch version, etc.).
Experiment Setup Yes Additionally, we select the convergence criterion of ϵ = 10 6 and run a maximum of 2000 EM iterations. Moreover, we repeat this process for each replication. Subsequently, for each j [k ], we initialize parameters β1i by sampling from a Gaussian distribution centered around its true counterpart β 1j with a small variance, where i Cj. Other parameters β0i, ai, bi, σi are also initialized in a similar fashion.