Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Partition Cover Approach to Tokenization

Authors: Jia Peng Lim, Shawn Tan, XianJun Davin Choo, Hady Lauw

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through empirical evaluations on real-world corpora, we show that GREEDTOK outperforms BPE and UNIGRAM on compression and achieves a covering score comparable to GREEDWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GREEDTOK as the tokenizer, shows that GREEDTOK achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.
Researcher Affiliation	Collaboration	Jia Peng Lim Singapore Management University EMAIL Shawn Tan MIT-IBM Watson AI Lab EMAIL Davin Choo Harvard University EMAIL Hady W. Lauw Singapore Management University EMAIL
Pseudocode	Yes	G Additional Pseudocode. Algorithm 1 GREEDTOK: Computing S. Algorithm 2 CANCOVER: Check if Wi,i+\|T \| 1 is coverable by T in current state M. Algorithm 3 SCORE: Calculate total number of possible covers. Algorithm 4 GREEDTOK: Tokenizing a given text W using S.
Open Source Code	Yes	Our implementation and compression evaluations can be found at https://github.com/Preferred AI/pcatt/; see supplementary materials. We release an open-source code repository.
Open Datasets	Yes	Evaluation on four real-world corpora: UN, arχiv, wiki, Pub Med. United Nations General Debate Corpus (UN) [JBD17]... This corpus has a Creative Commons (CC) 0: Public Domain License. arχiv... This corpus has a CC0: Public Domain License. Wikipedia-English (wiki)... We extract [Att15] the text from the database dump.6... Pub Med Central Open Access (Pub Med)... Refined Web. This corpus [PMH+23]... The corpus has a ODC-By 1.0 license. DCLM full-dedup. This corpus is built from DCLM [LFS+24]... This corpus is licensed under CC-by-4.
Dataset Splits	Yes	Our pre-training corpus is DCLM full-deduped dataset... trained on approximately 20% of the DCLM Dedup dataset, randomly selected... training on the dataset in two phases... sampling the first 500M documents for phase 1, and the next 100M documents for phase 2, which always use 20% of the training iterations of phase 1.
Hardware Specification	Yes	conducted with AMD EPYC 9654 @ 2.40GHz. We run our experiments using NVIDIA H100 80GB HBM3 cluster, with 96 logical CPU count, training at a rate of 400B tokens/day.
Software Dependencies	No	Our implementation of GREEDTOK is on C++ and accessible using Python bindings or through Hugging Face s API via a simple import line, enabling easy integration onto existing codebases. For model training, we use the Dolomite Engine [Mis24]. No specific version numbers for these software components are provided, which is required for a reproducible description.
Experiment Setup	Yes	Both models use a vocabulary size of 65,536 and are trained on approximately 20% of the DCLM Dedup dataset... Our model architecture is a 40-layer Transformer [VSP+17], with embedding size 1536, MLP using Swi GLU activation [Sha20] with intermediate size of 4096, and GQA [ALTd J+23] layers with 12 query heads and 4 pairs of KV-heads. We used a fixed context length of 4096 tokens and a batch size of 2^22 4M tokens... During training, we follow the same learning rate schedule as [SSM+24]... In the BPEM and GTET settings, we train the model for 125,000 and 25,000 iterations in phases 1 and 2 respectively. For GTEP, we take the model checkpoint of GTET at the 100,000th training iteration step, followed by an additional 20,000 training iterations in phase 2.