Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On Linear Mode Connectivity of Mixture-of-Experts Architectures
Authors: Viet-Hoang Tran, Van Hoan Trinh, Khanh-Vinh Bui, Tan M. Nguyen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we empirically validate the presence of LMC using our proposed algorithm across diverse Mo E configurations including dense, sparse, and shared-expert variants under a wide range of model settings and datasets of varying scales and modalities. Our results confirm the existence of LMC in Mo E architectures and offer fundamental insights into the functional landscape and optimization dynamics of deep learning models. |
| Researcher Affiliation | Academia | Viet-Hoang Tran Department of Mathematics National University of Singapore EMAIL Van-Hoan Trinh Department of Mathematics Technical University of Munich EMAIL Khanh Vinh Bui Independent Researcher Ho Chi Minh City, Vietnam EMAIL Tan M. Nguyen Department of Mathematics National University of Singapore EMAIL |
| Pseudocode | Yes | Algorithm 1 Weight Matching for Mixture-of-Experts Input: Mo E model weights ϕ = (Wi, bi, θi)i=1,...,n, ϕ = (W i, b i, θ i)i=1,...,n Output: Permutation τ for experts, and permutations {Pi}n i=1 for hidden units % Step 1: Match experts order using two methods for method in {gate, expert} do Compute cost matrix C Solve LAP to obtain expert permutation τmethod end for % The two candidate expert orderings τgate and τexpert are obtained % Step 2: Align internal weights of matched expert pairs for method in {gate, expert} do for i = 1 to n do Compute Pi by applying Weight Matching to θi and θ τmethod(i) end for end for return τgate, ({Pi}n i=1)gate , τexpert, ({Pi}n i=1)expert |
| Open Source Code | Yes | The code is publicly available at https://github.com/MLResearchX/lmc-moe. |
| Open Datasets | Yes | For vision tasks, our study includes MNIST, CIFAR-10, CIFAR-100, Image Net-1k, as well as transfer learning scenarios from Image Net-21k to CIFAR-10 and CIFAR-100. For language modeling, we utilize Wiki Text103 and the One Billion Word dataset. |
| Dataset Splits | Yes | Experimental Design. We investigate LMC by replacing the Feedforward Network (FFN) in the Transformer [84] layer with a randomly initialized Mo E, based on empirical evidence from Section 6.4, Appendices E, G.2, and G.3 indicating lower perturbation sensitivity when replacing deeper FFN layers with Mo Es. Only the Mo E parameters are fine-tuned using multiple random seeds for each experiment. LMC is evaluated by linearly interpolating between all checkpoint pairs, measuring model performance on the test set at 25 evenly spaced points along the interpolation... Datasets and Models. We use Vi T [20] for image classification (MNIST [51], CIFAR-10/100 [47], Image Net [18]) and GPT-2 [64] for language modeling (Wiki Text103 [59] and One Billion Word [14]). |
| Hardware Specification | Yes | All experiments are executed on a single NVIDIA H100 GPU with 80GB of memory, except for the One Billion Word task, which utilizes two H100 GPUs. |
| Software Dependencies | No | Due to the use of the JAX framework, approximately 75% of GPU memory (around 60GB) is pre-allocated by default. |
| Experiment Setup | Yes | Hyperparameters such as batch size, optimizer, number of experts, and hidden size are fixed, while the learning rate is tuned per setting. Full details are provided in Appendix F. |