Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts

Authors: Fanqi Yan, Huy Nguyen, Le Dung, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, in Section 4, we carry out several numerical experiments to empirically justify our theoretical results, and then conclude the paper in Section 5. In this section, we present several numerical experiments to verify our theoretical findings.
Researcher Affiliation Academia 1 Department of Computer Science, 2 Department of Statistics and Data Sciences, 3 Department of Electrical and Computer Engineering, The University of Texas at Austin EMAIL, EMAIL
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper. The methodology is described through mathematical formulations and textual explanations.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We use synthetic data for our experiments. We will consider releasing the code upon the acceptance of our work.
Open Datasets No Synthetic data generation. We create synthetic datasets following the model outlined in equation (1). Specifically, we generate data pairs {(Xi, Yi)}n i=1 X Y Rd R by first drawing each covariate Xi independently from a standard Gaussian distribution, for i = 1, . . . , n, and consistently set d = 8 across all trials.
Dataset Splits No Problem setting. Suppose that (X1, Y1), (X2, Y2), . . . , (Xn, Yn) X Y Rd R are i.i.d. samples of covariate-response pairs of size n. Synthetic data generation. We create synthetic datasets following the model outlined in equation (1). Specifically, we generate data pairs {(Xi, Yi)}n i=1 X Y Rd R by first drawing each covariate Xi independently from a standard Gaussian distribution, for i = 1, . . . , n, and consistently set d = 8 across all trials.
Hardware Specification Yes All the numerical experiments are performed on a Mac Book Air with an Apple M4 chip.
Software Dependencies No We use an EM algorithm [16] to compute the MLE, employing an off-the-shelf BFGS optimizer for the M-step due to the absence of a universal closed-form solution.
Experiment Setup Yes Experimental setup. Recall that, in the distinguishable setting, the pre-trained model f0 does not belong to the Gaussian density family. Thus, we let f0 be the density of a Laplace distribution, with mean function h0(x, η0) = tanh(η 0 x) and variance ν0. Here, η0 is a d-dimensional vector defined as e1 := (1, 0, . . . , 0), and ν0 = 0.001. Meanwhile, the prompt f is formulated as a Gaussian density, with the same tanh mean function but a different parameter η i.e., h(x, η ) = tanh((η ) x) and variance ν . On the other hand, in the non-distinguishable setting , both f and f0 belong to the Gaussian density family, and h and h0 are expert functions of the same form (albeit parameterized by different values of η0 and η ). As in the previous case, we let the expert function be the tanh function: in the pre-trained model, the expert is h(x, η0) = tanh(η 0 x), and in the prompt model, it is h(x, η ) = tanh((η ) x). Synthetic data generation. We create synthetic datasets following the model outlined in equation (1). Specifically, we generate data pairs {(Xi, Yi)}n i=1 X Y Rd R by first drawing each covariate Xi independently from a standard Gaussian distribution, for i = 1, . . . , n, and consistently set d = 8 across all trials. The responses Yi are drawn from the density p G (y|x), where G = (β , τ , η , ν ): (a) In the distinguishable setting, we let β = 1/ d 1d, τ = 1, η = e1 = η0 and ν = ν0 = 0.001. (b) In the non-distinguishable setting, we examine two cases to study the MLE convergence behavior as either η or ν varies with n: in the first, η is an O(n 1/8) perturbation of η0 with ν fixed at ν0; in the second, η = η0 while ν is perturbed around ν0 at the same rate. In detail, we set: (i) In the first case, β = 1/ d 1d, τ = 1, η = e1(1 + n 1/8) = η0(1 + n 1/8), and ν = ν0 = 0.001. (ii) In the second case, β = 1/ d 1d, τ = 1, η = e1 = η0, and ν = 0.001(1 + n 1/8) = ν0(1 + n 1/8). Training procedure. We conduct 40 experiments and, for each of them, consider 20 different sample sizes n, ranging from 103 to 105. In computing the MLEs, the initialization is set relatively close to the true parameter values to mitigate potential optimization instabilities. We use an EM algorithm [16] to compute the MLE, employing an off-the-shelf BFGS optimizer for the M-step due to the absence of a universal closed-form solution.