Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A Partition Cover Approach to Tokenization
Authors: Jia Peng Lim, Shawn Tan, XianJun Davin Choo, Hady Lauw
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through empirical evaluations on real-world corpora, we show that GREEDTOK outperforms BPE and UNIGRAM on compression and achieves a covering score comparable to GREEDWMC. Finally, our extensive pre-training for two transformer-based language models with 1 billion parameters, comparing the choices of BPE and GREEDTOK as the tokenizer, shows that GREEDTOK achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens. |
| Researcher Affiliation | Collaboration | Jia Peng Lim Singapore Management University EMAIL Shawn Tan MIT-IBM Watson AI Lab EMAIL Davin Choo Harvard University EMAIL Hady W. Lauw Singapore Management University EMAIL |
| Pseudocode | Yes | G Additional Pseudocode. Algorithm 1 GREEDTOK: Computing S. Algorithm 2 CANCOVER: Check if Wi,i+|T | 1 is coverable by T in current state M. Algorithm 3 SCORE: Calculate total number of possible covers. Algorithm 4 GREEDTOK: Tokenizing a given text W using S. |
| Open Source Code | Yes | Our implementation and compression evaluations can be found at https://github.com/Preferred AI/pcatt/; see supplementary materials. We release an open-source code repository. |
| Open Datasets | Yes | Evaluation on four real-world corpora: UN, arฯiv, wiki, Pub Med. United Nations General Debate Corpus (UN) [JBD17]... This corpus has a Creative Commons (CC) 0: Public Domain License. arฯiv... This corpus has a CC0: Public Domain License. Wikipedia-English (wiki)... We extract [Att15] the text from the database dump.6... Pub Med Central Open Access (Pub Med)... Refined Web. This corpus [PMH+23]... The corpus has a ODC-By 1.0 license. DCLM full-dedup. This corpus is built from DCLM [LFS+24]... This corpus is licensed under CC-by-4. |
| Dataset Splits | Yes | Our pre-training corpus is DCLM full-deduped dataset... trained on approximately 20% of the DCLM Dedup dataset, randomly selected... training on the dataset in two phases... sampling the first 500M documents for phase 1, and the next 100M documents for phase 2, which always use 20% of the training iterations of phase 1. |
| Hardware Specification | Yes | conducted with AMD EPYC 9654 @ 2.40GHz. We run our experiments using NVIDIA H100 80GB HBM3 cluster, with 96 logical CPU count, training at a rate of 400B tokens/day. |
| Software Dependencies | No | Our implementation of GREEDTOK is on C++ and accessible using Python bindings or through Hugging Face s API via a simple import line, enabling easy integration onto existing codebases. For model training, we use the Dolomite Engine [Mis24]. No specific version numbers for these software components are provided, which is required for a reproducible description. |
| Experiment Setup | Yes | Both models use a vocabulary size of 65,536 and are trained on approximately 20% of the DCLM Dedup dataset... Our model architecture is a 40-layer Transformer [VSP+17], with embedding size 1536, MLP using Swi GLU activation [Sha20] with intermediate size of 4096, and GQA [ALTd J+23] layers with 12 query heads and 4 pairs of KV-heads. We used a fixed context length of 4096 tokens and a batch size of 2^22 4M tokens... During training, we follow the same learning rate schedule as [SSM+24]... In the BPEM and GTET settings, we train the model for 125,000 and 25,000 iterations in phases 1 and 2 respectively. For GTEP, we take the model checkpoint of GTET at the 100,000th training iteration step, followed by an additional 20,000 training iterations in phase 2. |