Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Zebra-Llama: Towards Extremely Efficient Hybrid Models

Authors: Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, Emad Barsoum

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The results of our zero-shot evaluations are summarized in Table 1. We compare our Zebra-Llama with the base Llama models and other baselines based on distillation: Mamba In LLa MA (Hybrid Mamba2-GQA)[5], X-Eco MLA (Pure MLA)[12], Llamba (Pure Mamba2) [11], and Minitron (Pruning) [13]. In this section, we present a series of ablation studies aimed at justifying key design decisions in our approach. Specifically, we examine the impact of initialization strategies, the effectiveness of our SMART layer selection mechanism, the trade-offs between the number and size of MLA and Mamba2 layers, and the role of teacher model scaling.
Researcher Affiliation	Industry	Mingyu Yang , Mehdi Rezagholizadeh , Guihong Li , Vikram Appia, Emad Barsoum Advanced Micro Devices, Inc. (AMD) EMAIL
Pseudocode	Yes	Algorithm 1 Python-like pseudocode of the proposed SVD initialization for MLA. ... Algorithm 2 Pseudo code: SMART Structured MLA Layer Selection via Sensitivity Scores
Open Source Code	Yes	The source code is released at https://github.com/AMD-AGI/AMD-Hybrid-Models.
Open Datasets	Yes	For ILD and SFT, we use the same dataset as in [5] which includes multiple public datasets such as Open Hermes-2.5[15], Gen QA[16], and Infinity-Instruct [17], with a total number of 6.8 billion tokens. The dataset is splited into 20% and 80% for ILD and SFT separately. We repeat the same training data more than one epoch to match the desired token budget. For DPO preference tuning, we adopt three datasets Llama3-ultrafeedback[18], orca_dpo_pairs[19], and ultrafeedback_binarized[20]. All models were trained on a single node equipped with eight AMD MI300 GPUs. Our training details are provided in Appendix A.4. Evaluation Tasks We adopt the LM Harness Eval benchmark [21] to perform zero-shot and few-shot evaluations on language understanding tasks, which includes ARC-Challenge (ARC) [22], ARC-Easy (ARE) [22],Hella Swag (HS) [23], MMLU (MM) [24], Open Book QA (OB) [25], PIQA [26], RACE (RA) [27], and Wino Grande (WG) [28].
Dataset Splits	Yes	The dataset is splited into 20% and 80% for ILD and SFT separately.
Hardware Specification	Yes	All models were trained on a single node equipped with eight AMD MI300 GPUs. ... All experiments are conducted on a single AMD MI300X GPU with 192GB memory.
Software Dependencies	No	We evaluate all models using the lm-evaluation-harness library (commit from the big-refactor branch) following the task-specific few-shot configurations defined by the Open LLM Leaderboard. For zeroshot evaluation, we report performance across a broad suite of language understanding tasks: MMLU, Hella Swag, PIQA, ARC-Easy, ARC-Challenge, Winogrande, Open Book QA, and RACE. Evaluations are performed using the command-line interface with ROCm-enabled devices and a batch size of 16.
Experiment Setup	Yes	In Table 9, we present the training configurations for our Zebra-Llama series models, including the number of tokens, batch size, learning rate, and total training time. All experiments are conducted on a single node equipped with eight AMD MI300 GPUs, each featuring 192GB of memory. We apply a learning rate warmup over the first 1% of training data, followed by cosine annealing. The models are optimized using Adam W, with hyperparameters set to β = (0.9, 0.8). Additionally, all models process input sequences of length 2048 through sample packing.