On Least Square Estimation in Softmax Gating Mixture of Experts

Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct a simulation study to empirically demonstrate that the convergence rates of least square estimation under the softmax gating Mo E model with ridge experts h1(x, (a, b)) = sigmoid(ax + b) are significantly faster than those obtained when using linear experts h2(x, (a, b)) = ax + b.
Researcher Affiliation Academia 1Department of Statistics and Data Sciences, The University of Texas at Austin, USA.
Pseudocode No The paper describes algorithms and methods mathematically and verbally, but it does not include any formal pseudocode blocks or clearly labeled algorithm sections.
Open Source Code No The paper does not include any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes generating synthetic data based on specified parameters rather than using a pre-existing publicly available dataset. It states: 'we generate i.i.d samples {(Xi, Yi)}n i=1 by first sampling Xi s from the uniform distribution Uniform[0, 1] and then sampling Yi s from the regression equation Yi = f G (Xi) + εi'.
Dataset Splits No The paper describes the generation of synthetic data and the training process, but it does not explicitly specify distinct training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper describes the simulation study but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific details about ancillary software components (e.g., programming languages, libraries, frameworks) including their version numbers, which would be needed to replicate the experiments.
Experiment Setup Yes We use the stochastic gradient descent algorithm to minimize the mean square losses. We conduct 20 sample generations for each configuration, across a spectrum of 20 different sample sizes n ranging from 10^4 to 10^5. For each j [k ], we initialize parameters β1i by sampling from a Gaussian distribution centered around its true counterpart β 1j with a small variance, where i Aj. Other parameters β0i, ai, bi are also initialized similarly.