Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Large Language Model Inference with Neural Block Linearization
Authors: Mete Erdogan, Francesco Tonin, Volkan Cevher
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in Deep Seek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs. |
| Researcher Affiliation | Academia | Mete Erdogan, Francesco Tonin, Volkan Cevher Laboratory for Information and Inference Systems École Polytechnique Fédérale de Lausanne (EPFL), Switzerland |
| Pseudocode | Yes | Algorithm 1 Neural Block Linearization (NBL) |
| Open Source Code | Yes | The implementation is available at: https://github.com/LIONS-EPFL/NBL. |
| Open Datasets | Yes | Evaluation benchmarks include 5-shot performance on the MMLU task [Hendrycks et al., 2020] and 0-shot performance on ARC-easy (ARC-e), ARC-challenge (ARC-c) [Clark et al., 2018], Bool Q [Clark et al., 2019], Hella Swag [Zellers et al., 2019], OBQA [Mihaylov et al., 2018], PIQA [Bisk et al., 2020] and Wino Grande [Sakaguchi et al., 2021], following a similar evaluation as Zhang et al. [2025]. We implemented and evaluated NBL on an NVIDIA A100 GPU (80GB) using Py Torch [Paszke et al., 2019] and Hugging Face Transformers [Wolf, 2019]. Evaluation is carried out with the default parameters from the Evaluation Harness framework [Gao et al., 2024]. We compare NBL with several baseline methods, including SLEB [Song et al., 2024], Slice GPT [Ashkboos et al., 2024], Prune Net [Sengupta et al., 2025] and DROP [He et al., 2024], evaluating their performance on reasoning tasks and their improvements in latency and throughput. In the calibration of all methods, we used 256 samples from the C4 dataset [Raffel et al., 2020]. |
| Dataset Splits | Yes | Evaluation benchmarks include 5-shot performance on the MMLU task [Hendrycks et al., 2020] and 0-shot performance on ARC-easy (ARC-e), ARC-challenge (ARC-c) [Clark et al., 2018], Bool Q [Clark et al., 2019], Hella Swag [Zellers et al., 2019], OBQA [Mihaylov et al., 2018], PIQA [Bisk et al., 2020] and Wino Grande [Sakaguchi et al., 2021], following a similar evaluation as Zhang et al. [2025]. In the calibration of all methods, we used 256 samples from the C4 dataset [Raffel et al., 2020]. To fit the model in an NVIDIA A100 (80GB), we apply 4-bit post-training quantization using Activation-aware Weight Quantization (AWQ) [Lin et al., 2024] with default settings and 128 Pile samples [Gao et al., 2020] for calibration. Training is performed in bfloat16 for 3 epochs with a learning rate of 1e 4, an effective batch size of 16, and context length of 1024 tokens using a 5000-sample subset of the C4 validation split under a causal language modeling objective. |
| Hardware Specification | Yes | We implemented and evaluated NBL on an NVIDIA A100 GPU (80GB) using Py Torch [Paszke et al., 2019] and Hugging Face Transformers [Wolf, 2019]. To fit the model in an NVIDIA A100 (80GB), we apply 4-bit post-training quantization using Activation-aware Weight Quantization (AWQ) [Lin et al., 2024] with default settings and 128 Pile samples [Gao et al., 2020] for calibration. In this particular experiment setting to generate the figure, we used 2 NVIDIA A100 (80GB) GPU s, and a batch size of 16. |
| Software Dependencies | No | We implemented and evaluated NBL on an NVIDIA A100 GPU (80GB) using Py Torch [Paszke et al., 2019] and Hugging Face Transformers [Wolf, 2019]. In our implementation, compressing the Llama-3.1-8B and Mistral-7B models each containing 32 attention layers takes under 30 minutes using an NVIDIA A100 GPU (80 GB) for model inference and activation extraction, computing CCA bounds and NBL weights and biases. This runtime includes all necessary steps, including covariance estimation, SVD, and linear parameter computation. Appendix D.3 Implementation details: The algorithm is implemented using Py Torch [Paszke et al., 2019] and Sci Py [Virtanen et al., 2020] for tensor operations, linear algebra routines, and eigen-decomposition, and Hugging Face Transformers [Wolf, 2019] for loading and managing the pretrained models. |
| Experiment Setup | Yes | Evaluation is carried out with the default parameters from the Evaluation Harness framework [Gao et al., 2024]. In the calibration of all methods, we used 256 samples from the C4 dataset [Raffel et al., 2020]. Prompt processing builds the key-value (KV) cache for a 2048-token input, while token generation autoregressively produces 2048 tokens with a batch size of 1, following He et al. [2024]. To fit the model in an NVIDIA A100 (80GB), we apply 4-bit post-training quantization using Activation-aware Weight Quantization (AWQ) [Lin et al., 2024] with default settings and 128 Pile samples [Gao et al., 2020] for calibration. We then fine-tune these linearized layers using Lo RA with a rank of 32, α = 64, and a dropout of 0.1. Training is performed in bfloat16 for 3 epochs with a learning rate of 1e 4, an effective batch size of 16, and context length of 1024 tokens using a 5000-sample subset of the C4 validation split under a causal language modeling objective. |