Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MVA: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition

Authors: Wang Ning, Zekun Li, Tongxin Bai, Man Yao, Zhen Qin, Guoqi Li

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we explore inheriting LLM weights and converting them into linear models. Specifically, we adopt the Mistral-7B model as the base LLM and evaluate the performance of MVA-SW and MVA. We use the lm-evaluation-harness (Gao et al., 2024) tool to perform the test.
Researcher Affiliation Collaboration 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence 4Tap Tap.
Pseudocode No The paper describes the methodology using mathematical equations and textual descriptions, but there are no explicitly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not provide an explicit statement about the release of source code for the described methodology, nor does it include any links to code repositories.
Open Datasets Yes The dataset used is the Slim Pajama corpus. The dataset used is the Slim Pajama (Soboleva et al., 2023) corpus. The results are shown in Table 1 and Table 2. ... Slim Pajama: A 627B token cleaned and deduplicated version of Red Pajama. https://www.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/ datasets/cerebras/Slim Pajama-627B.
Dataset Splits No The paper mentions using 'Slim Pajama corpus', 'Alpaca-Clean and Red Pajama datasets' for fine-tuning and evaluation on 'lm-evaluation-harness' tasks like MMLU (5-shot). However, it does not provide specific details on how these datasets were split into training, validation, and test sets for their fine-tuning experiments, beyond mentioning training length in tokens and batch size.
Hardware Specification No The paper mentions 'GPU memory constraints' and provides memory usage in 'MiB' in Table 9, but does not specify the exact GPU model, CPU model, or other detailed hardware specifications used for experiments.
Software Dependencies No We use the lm-evaluation-harness (Gao et al., 2024) tool to perform the test. For fine-tuning, we utilize Lo RA (Hu et al., 2021) to achieve efficient fine-tuning. These tools are mentioned but without specific version numbers.
Experiment Setup Yes For fine-tuning, we utilize Lo RA with the QKV mapping and FFN down proj parameters, setting the rank to 128, alternatively, tuning only the QKV mapping with a rank of 8. ... Optimization is performed using Adam W with a cosine learning rate schedule, an initial learning rate of 4 10 5, 20 steps of linear warmup, and a training length of 1.5K due to GPU memory constraints, with a batch size of 0.1M tokens.