Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup

Authors: Fanxu Meng, Pingzhi Tang, Zengwei Yao, Xing Sun, Muhan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper demonstrates that MLA provides superior expressive power compared to GQA with the same KV cache overhead, thereby offering a rationale for transitioning from GQA to MLA. In addition, we introduce Trans MLA, a framework that seamlessly converts any GQA-based pre-trained model (e.g., LLa MA, Qwen, Gemma, Mistral/Mixtral) into an MLA-based model. For the first time, our method enables direct conversion of these models into a format compatible with Deep Seek s codebase, allowing them to fully leverage the existing, highly-optimized support for the Deep Seek architecture within inference engines like v LLM and SGlang. By compressing 93% of the KV cache in LLa MA2-7B, we achieve a 10x speedup with an 8K context length while maintaining meaningful output. Moreover, the model requires only 6B tokens for fine-tuning to recover comparable performance across multiple benchmarks.
Researcher Affiliation Collaboration Fanxu Meng1 , Pingzhi Tang1 , Zengwei Yao4, Xing Sun3, Muhan Zhang1,2 1Institute for Artificial Intelligence, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI 3Tencent Youtu Lab, Shanghai, China 4Xiaomi Corp., Beijing, China
Pseudocode No The paper describes methods using mathematical equations and prose (e.g., Section 3, Section 4), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes https://github.com/Mu Lab PKU/Trans MLA
Open Datasets Yes For the training process, we used a subset of the pretraining corpus from Smol LM [40]. The dataset comprises Fine Web-Edu-Dedup [41], Cosmopediav2 a synthetic dataset generated by Mixtral [42], Python-Edu from Star Coder [43], Open-Web Math [44], and data from Stack Overflow [45].
Dataset Splits No The paper mentions using a 'subset of the pretraining corpus' for fine-tuning, 'a small calibration dataset (e.g., Wikitext-2)', and 'distillation datasets containing 14,000 samples, with each sample consisting of 2,048 tokens'. While it describes data used for different purposes and some characteristics like 'data composition strategy' (Table 2), it does not provide explicit train/test/validation splits (e.g., percentages or specific counts for all primary experiments) for the datasets used in the main evaluation.
Hardware Specification Yes Our experiments were conducted on an 8-GPU machine, each GPU having 40GB of memory and delivering 312 TFLOPS of FP16 compute power. In Figure 5, we benchmarked the inference performance of an MLA model with a 92.97% reduction in KV cache size on three consumer-grade AI accelerators with different compute capabilities and memory sizes: 165.2 TFLOPS with 24GB memory, 312 TFLOPS with 40GB memory, and 320 TFLOPS with 64GB memory.
Software Dependencies No The paper mentions 'inference engines like v LLM and SGlang' and 'the v LLM framework' without specifying any version numbers for these or other software dependencies.
Experiment Setup Yes Detailed hyperparameter settings are provided in the Appendix G. Table 3: Training details across different models. (lists Batch size, Learning rate, Tokens, Warmup ratio, lr scheduler, Sequence length)