Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Masked Gated Linear Unit

Authors: Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on a variety of downstream NLP tasks, demonstrating that Swi MGLU achieves comparable or superior downstream accuracy to Swi GLU while notably improving inference throughput and memory efficiency, validating its practical effectiveness for resource-constrained LLM deployments.
Researcher Affiliation	Collaboration	Yukito Tajima1 Nakamasa Inoue1 Yusuke Sekikawa2 Ikuro Sato1,2 Rio Yokota1 1Institute of Science Tokyo, Japan 2Denso IT Laboratory, Japan
Pseudocode	Yes	Algorithm 1 Flash MGLU forward pass: Split-K Matrix Vector Product with Packed nm Masks. and Algorithm 3 Simplified CUDA Implementation of MGLU (nm = N_MASKS).
Open Source Code	No	Answer: [No] Justification: The code will be released with a permissive license in the near future.
Open Datasets	Yes	We pre-train both baseline and Swi MGLU models on the Fine Web-Edu 100B dataset (Penedo et al., 2024)
Dataset Splits	No	We pre-train both baseline and Swi MGLU models on the Fine Web-Edu 100B dataset (Penedo et al., 2024) with small models being trained on a 10B token subset. For downstream evaluation, we report zero-shot and two-shot accuracy on six standard benchmarks: ARC Easy (Arc E) (Clark et al., 2018), ARC Challenge (Arc C) (Clark et al., 2018), Hella Swag (HS) (Zellers et al., 2019), Pi QA Bisk et al. (2020), Sci Q (Welbl et al., 2017), and Winogrande (WG) (Sakaguchi et al., 2021). We utilize the LM Evaluation Harness (Gao et al., 2024) for standardized performance evaluation. No explicit split percentages or counts are provided for any dataset.
Hardware Specification	Yes	All models are trained on TSUBAME supercomputer with NVIDIA H100 GPUs (94GB), with small models trained on 4GPUs and large models trained on 16GPUs with Fully Sharded Data Parallel (FSDP). Figure 4 shows the wall-clock latency of a single projection layer in a single batch setting measured on an RTX 5090 at FP16 precision.
Software Dependencies	No	All experiments reported in this paper are implemented based on the llm-recipes framework (Fujii et al., 2024). No specific version number for llm-recipes or other software is provided, only the framework name and its citation date.
Experiment Setup	Yes	Table 7 lists the hyperparameters that we use by default at training time for all our experiments.