Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Masked Gated Linear Unit
Authors: Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on a variety of downstream NLP tasks, demonstrating that Swi MGLU achieves comparable or superior downstream accuracy to Swi GLU while notably improving inference throughput and memory efficiency, validating its practical effectiveness for resource-constrained LLM deployments. |
| Researcher Affiliation | Collaboration | Yukito Tajima1 Nakamasa Inoue1 Yusuke Sekikawa2 Ikuro Sato1,2 Rio Yokota1 1Institute of Science Tokyo, Japan 2Denso IT Laboratory, Japan |
| Pseudocode | Yes | Algorithm 1 Flash MGLU forward pass: Split-K Matrix Vector Product with Packed nm Masks. and Algorithm 3 Simplified CUDA Implementation of MGLU (nm = N_MASKS). |
| Open Source Code | No | Answer: [No] Justification: The code will be released with a permissive license in the near future. |
| Open Datasets | Yes | We pre-train both baseline and Swi MGLU models on the Fine Web-Edu 100B dataset (Penedo et al., 2024) |
| Dataset Splits | No | We pre-train both baseline and Swi MGLU models on the Fine Web-Edu 100B dataset (Penedo et al., 2024) with small models being trained on a 10B token subset. For downstream evaluation, we report zero-shot and two-shot accuracy on six standard benchmarks: ARC Easy (Arc E) (Clark et al., 2018), ARC Challenge (Arc C) (Clark et al., 2018), Hella Swag (HS) (Zellers et al., 2019), Pi QA Bisk et al. (2020), Sci Q (Welbl et al., 2017), and Winogrande (WG) (Sakaguchi et al., 2021). We utilize the LM Evaluation Harness (Gao et al., 2024) for standardized performance evaluation. No explicit split percentages or counts are provided for any dataset. |
| Hardware Specification | Yes | All models are trained on TSUBAME supercomputer with NVIDIA H100 GPUs (94GB), with small models trained on 4GPUs and large models trained on 16GPUs with Fully Sharded Data Parallel (FSDP). Figure 4 shows the wall-clock latency of a single projection layer in a single batch setting measured on an RTX 5090 at FP16 precision. |
| Software Dependencies | No | All experiments reported in this paper are implemented based on the llm-recipes framework (Fujii et al., 2024). No specific version number for llm-recipes or other software is provided, only the framework name and its citation date. |
| Experiment Setup | Yes | Table 7 lists the hyperparameters that we use by default at training time for all our experiments. |