Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching
Authors: Benjamin Minixhofer, Ivan Vulić, Edoardo Maria Ponti
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify the efficacy of our method on three distinct use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedentedly effective transfer across tokenizers, including rapid transfer of subword models to the byte-level. ... Secondly, we distil a large maths-specialised LLM into a small general-purpose model with a different tokenizer, achieving competitive maths problem-solving performance. Thirdly, we use our method to train state-of-the-art embedding prediction hypernetworks for training-free tokenizer transfer. |
| Researcher Affiliation | Academia | Benjamin Minixhofer 0x43 Ivan Vuli c 0x43 Edoardo M. Ponti 0x45,0x43 0x43University of Cambridge 0x45University of Edinburgh |
| Pseudocode | No | The paper includes a figure (Figure 1) that sketches the method but does not provide structured pseudocode or algorithm blocks. It refers to 'alignment algorithm' in Appendix I but doesn't present the algorithm steps in a pseudocode format. |
| Open Source Code | Yes | Our code and models are available at github.com/bminixhofer/tokenkit. |
| Open Datasets | Yes | We train on the Tulu3 instruction-tuning dataset (Lambert et al., 2025) with Lo RA (Hu et al., 2022). ... We evaluate on a standard set of natural language benchmarks consisting of Pi QA (Bisk et al., 2020), ARC-Challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), MMLU (Hendrycks et al., 2021), AGIEval (Zhong et al., 2023) and IFEval (Zhou et al., 2023). ... We use the Open Math Instruct-2 dataset (which the teacher has been trained on; Toshniwal et al., 2024)... We report zero-shot accuracy on GSM8K (Cobbe et al., 2021) and the MATH benchmark (Hendrycks et al., 2021). |
| Dataset Splits | No | The paper mentions using specific datasets for training and evaluation benchmarks. For example, 'We train on the Tulu3 instruction-tuning dataset' and 'We evaluate on a standard set of natural language benchmarks'. While standard benchmarks often have predefined splits, the paper does not explicitly state the dataset splits (e.g., percentages or sample counts for training, validation, and test sets) for reproduction. |
| Hardware Specification | Yes | We conduct all experiments on a cluster of 40 v3 TPU chips and 64 v4 TPU chips. The largest individual experiments run on a pod of 32 v4 TPU chips and take 24 hours for transfer of Llama3 to byte-level tokenization, 12 hours for transfer of Gemma2 to byte-level tokenization, 2 days for Gemma2 hypernetwork training and 5 to 10 hours individually per remaining experiment. |
| Software Dependencies | No | We use Adam (Kingma & Ba, 2015) without weight decay following the adaptations settings of Groeneveld et al. (2024). ... We use Lo RA (Hu et al., 2022) with α = r = 64. ... We use lm-eval (Gao et al., 2024) for all evaluations. The paper mentions various tools and methods but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Across experiments, unless specified otherwise, we use a batch size of 64 texts and a sequence length of 512 tokens for both student and teacher. We use Adam (Kingma & Ba, 2015) without weight decay following the adaptations settings of Groeneveld et al. (2024). We choose a default peak learning rate of 1e 5 based on the findings in Appendix A.1, training for 20k steps with linear warmup over 2k steps, then linear decay to zero. We use Grad Mag (c.f. Section 3.3) to balance loss components. We use Lo RA (Hu et al., 2022) with α = r = 64. For ALM, we use a threshold γ = 0.1, the temperature τ = 100, and set f to f KL(p T p S) = p T log p T p S to recover the KL-divergence, analysing the impact of these choices in Appendix A.2. ... We quadruple the student sequence length to 2048, halve the batch size to 32 texts, increase the learning rate to 3e-5 and train the full model. |