Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks
Authors: Mingze Wang, Weinan E
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To support our main theoretical results, we conduct two new experiments, each aligned with one of our key insights. The experimental details are shown in Appendix C. |
| Researcher Affiliation | Academia | Mingze Wang School of Mathematical Sciences, Peking University, Beijing, China EMAIL Weinan E Center for Machine Learning Research and School of Mathematical Sciences, Peking University, Beijing, China AI for Science Institute, Beijing, China EMAIL |
| Pseudocode | No | The paper describes theoretical concepts and proofs, and includes experimental validation. However, it does not contain any explicitly labeled pseudocode or algorithm blocks. The methods are described in prose and mathematical formulations. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code or data of the experiments are simple and easy to reproduce following the description in the paper. |
| Open Datasets | No | Specifically, we consider the low-dimensional manifold M = {x RD : x2 1 + x2 2 = 1; xi = 0, i > 2} embedded in RD with D > 2. The target function is f(x) = sin(5x1) + cos(3x2), defined on M. ... As defined in our Figure 3, we consider the piecewise function f with compositional sparsity defined over 32 = 9 unit cubes. |
| Dataset Splits | No | The experiments use mathematically defined functions and manifolds for evaluation rather than external datasets with predefined splits. There is no mention of train/test/validation splits for any dataset. |
| Hardware Specification | Yes | The experiments in Section 6 are conducted on 1 A100 GPU. |
| Software Dependencies | No | In Experiment I, the models are trained for 2, 000 iterations with batch size 128 (online), using squared loss and Adam optimizer with learning rate 1e-3. In Experiment II, the models are trained for 5, 000 iterations with batch size 128 (online), using squared loss and Adam optimizer with learning rate 1e-3. While Adam optimizer and squared loss are mentioned, no specific software library versions (e.g., PyTorch, TensorFlow) or Python versions are provided. |
| Experiment Setup | Yes | In Experiment I, the models are trained for 2, 000 iterations with batch size 128 (online), using squared loss and Adam optimizer with learning rate 1e-3. ... As a model, we consider 1-4-Mo E , a 1-layer Mo E comprising 1 router and 4 experts, where each expert is a two-layer Re LU network with hidden width 10. ... In Experiment II, the models are trained for 5, 000 iterations with batch size 128 (online), using squared loss and Adam optimizer with learning rate 1e-3. ... As the model, we consider 2-3-Mo E (a 2-layer Mo E comprising 2 routing layers and 2 expert layers with 3 experts each); ... Each expert is a two-layer Re LU FFN with hidden width m {16, 32, 64, 128}. |