Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Language Models over Canonical Byte-Pair Encodings

Authors: Tim Vieira, Tianyu Liu, Clemente Pasti, Yahya Emara, Brian Dusell, Benjamin Lebrun, Mario Giulianelli, Juan Luis Gastaldi, Timothy J. O’Donnell, Ryan Cotterell

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora. 5. Experiments This section evaluates our proposed methods canonicality by constraints (global and local; 3) and canonicality by conditioning ( 4) by measuring their impact on real datasets and language models.
Researcher Affiliation	Academia	1ETH Zürich 2Mila 3Mc Gill University 4Canada CIFAR AI Chair. Correspondence to: Tim Vieira <EMAIL>.
Pseudocode	Yes	1 def rejection_sampling(): 2 while True: 3 δ sample(p ) 4 if δ D: return δ
Open Source Code	Yes	github.com/genlm/canonical-icml-2025
Open Datasets	Yes	Penn Treebank (PTB, Marcus et al., 1993) (test split; 3761 strings, 82k words, 439k characters) Wiki Text (Merity et al., 2017) (test split; 4358 strings, 234k words and 1286k characters)
Dataset Splits	Yes	Penn Treebank (PTB, Marcus et al., 1993) (test split; 3761 strings, 82k words, 439k characters) Wiki Text (Merity et al., 2017) (test split; 4358 strings, 234k words and 1286k characters) We fine-tuned two language models, GPT2S and GPT-2M,17 on the PTB train set and a subset of the Wiki Text train set with 50K strings and 4.2M words.
Hardware Specification	No	No specific hardware details (like GPU/CPU models) are provided. The mention of 'bfloat16' refers to a data type used for model parameters, not a hardware specification.
Software Dependencies	No	The paper mentions the 'Adam W optimizer' but does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	We fine-tuned two language models, GPT2S and GPT-2M,17 on the PTB train set and a subset of the Wiki Text train set with 50K strings and 4.2M words. We consider fine-tuning the canonicalized architecture (ℓθ) and the original architecture (pθ ) using the training criterion Fλ for λ {0.001, 0.01, 0.1, 0.2}.18 Each model is trained for 3 epochs using the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 5e 5 and linear learning rate decay. For efficiency, we use bfloat16 to represent the model parameters. We use a minibatch of size 8 for estimating the gradient of each term of the Fλ objective.