Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Understanding Differential Transformer Unchains Pretrained Self-Attentions

Authors: Chaerin Kong, Jiho Jang, Nojun Kwak

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (< 0.01%).
Researcher Affiliation Collaboration Chaerin Kong1,2 Jiho Jang2 Nojun Kwak2 1 Twelve Labs 2 Seoul National University EMAIL
Pseudocode Yes Figure 9: Differential Extension (DEX). The output value matrix O is transformed by subtracting a λ-modulated projection from itself. This operation targets a layer-specific subset of attention heads. def Attn(X,W_q ,W_k ,W_v ,f_D ,λ,do): # standard softmax attention Q, K, V = X @ W_q , X @ W_k , X @ W_v s = 1 / sqrt(d) A = Q @ K.transpose (-1, -2) * s O = softmax(A) @ V # implicit differential adaptation O = O λ f_D(O) if do else O return O def MHA(X,W_q ,W_k ,W_v ,f_D ,W_o ,λ,hs): O = [Attn(X,... ,λ,do=(i in hs)) for i in range(h)] # hs: selected heads return Concat(O) @ W_o
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Some part of the code is proprietary asset, which prohibits disclosure.
Open Datasets Yes We constructed our custom training corpus using a subset of the Dolmino dataset5. Specifically, we mixed web pages, academic papers, encyclopedia entries, and code texts in approximate ratios of 74.3%, 6.5%, 7.9%, and 11.3% respectively. This resulted in a corpus totaling 887M tokens (measured using the Llama-3 tokenizer). Our data preparation generally followed the recipe of OLMo2 [62]... 5https://huggingface.co/datasets/allenai/dolmino-mix-1124
Dataset Splits No We constructed our custom training corpus using a subset of the Dolmino dataset5. Specifically, we mixed web pages, academic papers, encyclopedia entries, and code texts in approximate ratios of 74.3%, 6.5%, 7.9%, and 11.3% respectively. This resulted in a corpus totaling 887M tokens (measured using the Llama-3 tokenizer). Our data preparation generally followed the recipe of OLMo2 [62]... All models, including baselines and DEX variants, were trained on our custom corpus for 1 epoch. A context length of 32k tokens was used for all Llama and Qwen models during this training phase. We report performances on 11 widely used language modeling benchmarks [16, 78, 76, 86, 58, 8, 68] using [25].
Hardware Specification Yes All experiments were conducted using 8 NVIDIA A100-80GB GPUs, with the run time ranging from 2.5-16 hours depending on the model size.
Software Dependencies No All tests were conducted on a single NVIDIA A100-80GB GPU, utilizing Py Torch s standard scaled dot-product attention implementation4. The reported throughputs are averaged over 30 batches, following an initial 5 warm-up batches.
Experiment Setup Yes Training All models, including baselines and DEX variants, were trained on our custom corpus for 1 epoch. A context length of 32k tokens was used for all Llama and Qwen models during this training phase. We employed a cosine learning rate schedule, using a peak learning rate of 1 10 4 for partial fine-tuning methods (including DEX) and 1 10 5 for full fine-tuning baselines, as these settings generally yielded the best outcomes in preliminary experiments. A learning rate warm-up ratio of 0.03 was used.