Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Learning Mixtures of Experts with EM: A Mirror Descent Perspective
Authors: Quentin Fruytier, Aryan Mokhtari, Sujay Sanghavi
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of Mo E models. We first rigorously analyze EM for Mo E where the conditional distribution of the target and latent variable conditioned on the feature variable belongs to an exponential family of distributions and show its equivalence to projected Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence; In the special case of mixture of 2 linear or logistic experts, we additionally provide guarantees for linear convergence based on the signal-to-noise ratio. Experiments on synthetic and (small-scale) real-world data supports that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy. |
| Researcher Affiliation | Academia | 1Electrical and Computer Engineering Department, University of Texas at Austin, Austin, Texas, United States of America. Correspondence to: Quentin Fruytier <EMAIL>, Aryan Mokhtari <EMAIL>, Sujay Sanghavi <EMAIL>. |
| Pseudocode | Yes | D) Algorithms D.1) EM for Mo E Algorithm 1 EM for Mo E D.2) Gradient EM for Mo E Algorithm 2 Gradient EM for MOE D.3) Gradient Descent for Mo E Algorithm 3 Gradient Descent for Mo E |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository. |
| Open Datasets | Yes | We evaluate these methods on both a synthetic dataset and the real-world Fashion MNIST dataset, consistently reporting significant improvements for EM and Gradient EM over GD. We also provide mini-batch CIFAR-10 experiments with more than 2-experts in Appendix C.1. Validation Experiment on Fashion MNIST. For the small scale proof of concept experiment on Fashion MNIST (Xiao et al., 2017), we alter the dataset so as to simulate a mixture of 2 Logistic Experts. |
| Dataset Splits | No | The paper mentions using a "test set" for Fashion MNIST and CIFAR-10 but does not provide specific details on how the datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or explicit splitting methodology). |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We sampled 103 data points from an Sym Mo Lin E with known additive unit Gaussian noise (i.e. N(0, 1)) and true parameters β , w R10 that sastisfy β 2 = w 2 = 4. Subsequently, we run full-batch EM, Gradient EM, and GD for 50 iterations and report the results on the training set averaged over 50 instances. Each time, re-sampling the true parameters, initial parameters, and whole dataset. The initial parameters, are randomly initialize within a neighborhood of the true parameters, and are consistent across all benchmarks. ... The 2-component Mo Log E to be trained consists of one Linear gating layer of dimension 2 28 28, and 2 logistic experts of dimension 10 28 28 each. We randomly initialize each linear layer to be unit norm and execute the algorithms on the same datasets and with the same initializations. For Gradient EM, the only additional code needed over GD is to define the EM Loss function appropriately, and then perform a Gradient Step on the Gating parameters and the Expert parameters separately as describe in Algorithm 2. For EM, for each iteration, we perform several gradient steps in an inner loop to approximately recover the solutions to the sub-problems described in (9). We report our findings for the full-batch iteration of the respective algorithms in Table 2 and Figure 3. ... after 100 iterations of EM, Gradient EM and GD... a 5-component Mo E... individual experts that each consist of a single hidden layer MLP with hidden dimension 100 and Re LU activation. The gating function also consists of a single hidden layer MLP with hidden dimension 100 and Re LU activation. We randomly initialize each linear layer to have rows that are unit-norm |