Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Authors: Shuqing Luo, Ye Han, Pingzhi Li, Jiayin Qin, Jie Peng, Yang Zhao, Yu Cao, Tianlong Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation across three popular Mo E models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized Mo E-LLMs.
Researcher Affiliation Academia 1University of North Carolina at Chapel Hill 2University of Minnesota Twin Cities
Pseudocode Yes Algorithm 1 Expert Clustering.
Open Source Code Yes We provide all code and data necessary to reproduce every experimental result that we describe in this paper. Please find the code base for this paper here: https://github.com/UNITES-Lab/Mozart
Open Datasets Yes We use Alpaca [40], an instruction tuning dataset of 52K samples, for all our experiments. [40] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/ stanford_alpaca, 2023.
Dataset Splits No The paper mentions using "Alpaca [40], an instruction tuning dataset of 52K samples" but does not specify how this dataset is split into training, validation, or test sets for the experiments conducted. It describes processing in terms of "32 samples (sequences), divided into 4 serially executed micro-batches of size 8" but this refers to micro-batching during an experimental run, not a formal dataset split for reproducibility.
Hardware Specification Yes Our experiments include three Mo E models with various architectures: Qwen3-30B-A3B [44, 45], OLMo E-1B-7B-0924 [28], and deepseek-moe-16b-base [4]... We use NVIDIA A100 80G GPU servers and Py Torch for our profiling and simulation experiments... The overall Mozart architecture comprises 16 expert-cluster chiplets for Mo E computation, organized into 4 switch-connected clusters, as well as one dedicated attention chiplet. Each Mo E/attention chiplet has 36 100 tiles, with 16 Systolic Arrays (SAs) in one tile and 256 576 Processing Elements (PEs) in one SA. Off-chip memory is provided by 6 HBM2-based DRAM [29]... targeting 28nm technology... under 1GHz clock frequency. Detailed configurations for the three models are summarized in Table 2.
Software Dependencies No We use NVIDIA A100 80G GPU servers and Py Torch for our profiling and simulation experiments. We implement the logic dies, SRAM dies, inter-chiplet interconnects and switches in Verilog, and synthesize the gate-level netlist using Synopsys Design Compiler [38] targeting 28nm technology. The typical power consumption is as reported by Synopsys Prime Power [39] based on the generated gate-level netlist. The paper mentions software names like PyTorch, Verilog, Synopsys Design Compiler, and Synopsys Prime Power but does not provide specific version numbers for these tools as required for reproducible software dependencies.
Experiment Setup Yes Our experiments include three Mo E models with various architectures: Qwen3-30B-A3B [44, 45], OLMo E-1B-7B-0924 [28], and deepseek-moe-16b-base [4]... We use Alpaca [40], an instruction tuning dataset of 52K samples, for all our experiments... The system processes 32 samples (sequences), divided into 4 serially executed micro-batches of size 8. A weight-streaming strategy is adopted, where only one transformer block s weights are loaded at a time... We adjust hardware configurations for all three algorithmic baselines with FP16 precision... We simulate all the design under 1GHz clock frequency. The micro batch size for streaming attention/expert tokens set to 8.