Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

Authors: Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong Park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present Code GEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency memory accuracy trade-offs under a unified implementation. On Llama-3 models, Code GEMM delivers 1.83 (8B) and 8.93 (70B) speedups in the 2-bit configuration compared to state-of-the-art codebookbased quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization. 4 Experiments Setup. We evaluate Code GEMM by exploring the trade-offs across key hyperparameters, focusing on three primary metrics relevant to LLM compression: memory footprint, latency, and accuracy.
Researcher Affiliation Industry Gunho Park, Jeongin Bae, Byeongwook Kim, Baeseong Park, Jiwon Ryu, Hoseung Kim, Se Jung Kwon, Dongsoo Lee NAVER Cloud EMAIL
Pseudocode No The paper includes Figure 3, which is an 'Overview of the Code GEMM kernel operation' depicting steps with diagrams and text descriptions (e.g., 'Step❶: Input reshape', 'Step❷: Build Psumbook', 'Step❸: Data retrieval & Accumulation'). However, it does not contain a formally structured pseudocode block or algorithm section with numbered textual steps. The methodology is described in paragraph form.
Open Source Code Yes github.com/naver-aics/codegemm
Open Datasets Yes On Llama-3 models, Code GEMM delivers 1.83 (8B) and 8.93 (70B) speedups in the 2-bit configuration compared to state-of-the-art codebookbased quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization. [...] Accuracy is evaluated using the lm-eval-harness [7] benchmark suite across both zero-shot and 5-shot settings on standard tasks. [...] Perplexity, measured using the Wiki Text-2 dataset, is used to evaluate accuracy.
Dataset Splits No The paper mentions that "Accuracy is evaluated using the lm-eval-harness [7] benchmark suite across both zero-shot and 5-shot settings on standard tasks." While this indicates the use of established benchmarks and evaluation protocols, it does not explicitly provide the training, validation, or test splits (e.g., percentages or sample counts) for the datasets used within the text of the paper.
Hardware Specification Yes All latency measurements are performed on an NVIDIA A100 80GB GPU.
Software Dependencies No Throughput (or, equivalently, end-to-end latency) is additionally measured using the Llama implementation provided by the Hugging Face [29] library with layer fusion. Although this library is not optimized for high-throughput inference, it remains one of the most widely used frameworks and thus serves as a practical baseline. [...] We measured DRAM traffic proxies and power efficiency using nvidia-smi [18] telemetry. While specific software libraries are mentioned (Hugging Face, nvidia-smi), no specific version numbers for these dependencies are provided in the text.
Experiment Setup Yes On Llama-3 models, Code GEMM delivers 1.83 (8B) and 8.93 (70B) speedups in the 2-bit configuration compared to state-of-the-art codebookbased quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization. [...] Specifically, latency is reported as the sum of kernel execution times for all linear layers in a single Transformer decoder block without layer fusion. [...] We measured DRAM traffic proxies and power efficiency using nvidia-smi [18] telemetry. Metrics were sampled at a 100 ms cadence over a 10 s window and averaged across trials. [...] Table 3: Kernel-level Performance evaluation on a GEMV with (M, N, K) = (1, 28672, 8192). [...] Appendix A.2 Tile Size Sensitivity: We revisited our heuristic choices for the tile dimensions and conducted a systematic sweep over tw {32, 64, 128} and th {2048, 4096} across representative shapes. [...] Appendix A.3 Effect of Higher Bit Precision: We additionally measured latency for higher average bit precisions using the kernel configuration (g=128, b=8, tw=32, th=2048). [...] Appendix A.4 Batch-Size Sensitivity: Table 9 reports linear latency on Llama-3-8B as a function of batch size (BS).