Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup

Authors: Fanxu Meng, Pingzhi Tang, Zengwei Yao, Xing Sun, Muhan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper demonstrates that MLA provides superior expressive power compared to GQA with the same KV cache overhead, thereby offering a rationale for transitioning from GQA to MLA. In addition, we introduce Trans MLA, a framework that seamlessly converts any GQA-based pre-trained model (e.g., LLa MA, Qwen, Gemma, Mistral/Mixtral) into an MLA-based model. For the first time, our method enables direct conversion of these models into a format compatible with Deep Seek s codebase, allowing them to fully leverage the existing, highly-optimized support for the Deep Seek architecture within inference engines like v LLM and SGlang. By compressing 93% of the KV cache in LLa MA2-7B, we achieve a 10x speedup with an 8K context length while maintaining meaningful output. Moreover, the model requires only 6B tokens for fine-tuning to recover comparable performance across multiple benchmarks.
Researcher Affiliation	Collaboration	Fanxu Meng1 , Pingzhi Tang1 , Zengwei Yao4, Xing Sun3, Muhan Zhang1,2 1Institute for Artificial Intelligence, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI 3Tencent Youtu Lab, Shanghai, China 4Xiaomi Corp., Beijing, China
Pseudocode	No	The paper describes methods using mathematical equations and prose (e.g., Section 3, Section 4), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	https://github.com/Mu Lab PKU/Trans MLA
Open Datasets	Yes	For the training process, we used a subset of the pretraining corpus from Smol LM [40]. The dataset comprises Fine Web-Edu-Dedup [41], Cosmopediav2 a synthetic dataset generated by Mixtral [42], Python-Edu from Star Coder [43], Open-Web Math [44], and data from Stack Overflow [45].
Dataset Splits	No	The paper mentions using a 'subset of the pretraining corpus' for fine-tuning, 'a small calibration dataset (e.g., Wikitext-2)', and 'distillation datasets containing 14,000 samples, with each sample consisting of 2,048 tokens'. While it describes data used for different purposes and some characteristics like 'data composition strategy' (Table 2), it does not provide explicit train/test/validation splits (e.g., percentages or specific counts for all primary experiments) for the datasets used in the main evaluation.
Hardware Specification	Yes	Our experiments were conducted on an 8-GPU machine, each GPU having 40GB of memory and delivering 312 TFLOPS of FP16 compute power. In Figure 5, we benchmarked the inference performance of an MLA model with a 92.97% reduction in KV cache size on three consumer-grade AI accelerators with different compute capabilities and memory sizes: 165.2 TFLOPS with 24GB memory, 312 TFLOPS with 40GB memory, and 320 TFLOPS with 64GB memory.
Software Dependencies	No	The paper mentions 'inference engines like v LLM and SGlang' and 'the v LLM framework' without specifying any version numbers for these or other software dependencies.
Experiment Setup	Yes	Detailed hyperparameter settings are provided in the Appendix G. Table 3: Training details across different models. (lists Batch size, Learning rate, Tokens, Warmup ratio, lr scheduler, Sequence length)