On Least Square Estimation in Softmax Gating Mixture of Experts
Authors: Huy Nguyen, Nhat Ho, Alessandro Rinaldo
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct a simulation study to empirically demonstrate that the convergence rates of least square estimation under the softmax gating Mo E model with ridge experts h1(x, (a, b)) = sigmoid(ax + b) are significantly faster than those obtained when using linear experts h2(x, (a, b)) = ax + b. |
| Researcher Affiliation | Academia | 1Department of Statistics and Data Sciences, The University of Texas at Austin, USA. |
| Pseudocode | No | The paper describes algorithms and methods mathematically and verbally, but it does not include any formal pseudocode blocks or clearly labeled algorithm sections. |
| Open Source Code | No | The paper does not include any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes generating synthetic data based on specified parameters rather than using a pre-existing publicly available dataset. It states: 'we generate i.i.d samples {(Xi, Yi)}n i=1 by first sampling Xi s from the uniform distribution Uniform[0, 1] and then sampling Yi s from the regression equation Yi = f G (Xi) + εi'. |
| Dataset Splits | No | The paper describes the generation of synthetic data and the training process, but it does not explicitly specify distinct training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper describes the simulation study but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific details about ancillary software components (e.g., programming languages, libraries, frameworks) including their version numbers, which would be needed to replicate the experiments. |
| Experiment Setup | Yes | We use the stochastic gradient descent algorithm to minimize the mean square losses. We conduct 20 sample generations for each configuration, across a spectrum of 20 different sample sizes n ranging from 10^4 to 10^5. For each j [k ], we initialize parameters β1i by sampling from a Gaussian distribution centered around its true counterpart β 1j with a small variance, where i Aj. Other parameters β0i, ai, bi are also initialized similarly. |