Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding Differential Transformer Unchains Pretrained Self-Attentions
Authors: Chaerin Kong, Jiho Jang, Nojun Kwak
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations confirm that DEX substantially improves the pretrained LLMs across diverse benchmarks, achieving significant performance gains with minimal adaptation data (< 0.01%). |
| Researcher Affiliation | Collaboration | Chaerin Kong1,2 Jiho Jang2 Nojun Kwak2 1 Twelve Labs 2 Seoul National University EMAIL |
| Pseudocode | Yes | Figure 9: Differential Extension (DEX). The output value matrix O is transformed by subtracting a λ-modulated projection from itself. This operation targets a layer-specific subset of attention heads. def Attn(X,W_q ,W_k ,W_v ,f_D ,λ,do): # standard softmax attention Q, K, V = X @ W_q , X @ W_k , X @ W_v s = 1 / sqrt(d) A = Q @ K.transpose (-1, -2) * s O = softmax(A) @ V # implicit differential adaptation O = O λ f_D(O) if do else O return O def MHA(X,W_q ,W_k ,W_v ,f_D ,W_o ,λ,hs): O = [Attn(X,... ,λ,do=(i in hs)) for i in range(h)] # hs: selected heads return Concat(O) @ W_o |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Some part of the code is proprietary asset, which prohibits disclosure. |
| Open Datasets | Yes | We constructed our custom training corpus using a subset of the Dolmino dataset5. Specifically, we mixed web pages, academic papers, encyclopedia entries, and code texts in approximate ratios of 74.3%, 6.5%, 7.9%, and 11.3% respectively. This resulted in a corpus totaling 887M tokens (measured using the Llama-3 tokenizer). Our data preparation generally followed the recipe of OLMo2 [62]... 5https://huggingface.co/datasets/allenai/dolmino-mix-1124 |
| Dataset Splits | No | We constructed our custom training corpus using a subset of the Dolmino dataset5. Specifically, we mixed web pages, academic papers, encyclopedia entries, and code texts in approximate ratios of 74.3%, 6.5%, 7.9%, and 11.3% respectively. This resulted in a corpus totaling 887M tokens (measured using the Llama-3 tokenizer). Our data preparation generally followed the recipe of OLMo2 [62]... All models, including baselines and DEX variants, were trained on our custom corpus for 1 epoch. A context length of 32k tokens was used for all Llama and Qwen models during this training phase. We report performances on 11 widely used language modeling benchmarks [16, 78, 76, 86, 58, 8, 68] using [25]. |
| Hardware Specification | Yes | All experiments were conducted using 8 NVIDIA A100-80GB GPUs, with the run time ranging from 2.5-16 hours depending on the model size. |
| Software Dependencies | No | All tests were conducted on a single NVIDIA A100-80GB GPU, utilizing Py Torch s standard scaled dot-product attention implementation4. The reported throughputs are averaged over 30 batches, following an initial 5 warm-up batches. |
| Experiment Setup | Yes | Training All models, including baselines and DEX variants, were trained on our custom corpus for 1 epoch. A context length of 32k tokens was used for all Llama and Qwen models during this training phase. We employed a cosine learning rate schedule, using a peak learning rate of 1 10 4 for partial fine-tuning methods (including DEX) and 1 10 5 for full fine-tuning baselines, as these settings generally yielded the best outcomes in preliminary experiments. A learning rate warm-up ratio of 0.03 was used. |