Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

zip2zip: Inference-Time Adaptive Tokenization via Online Compression

Authors: Saibo Geng, Nathan Ranchin, Yunzhen Yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the effectiveness of zip2zip, we perform continued pretraining on the Phi-3 models (3B and 14B) within the zip2zip framework. We train a single model on a general-purpose corpus and evaluate it across four dimensions: (1) token efficiency, (2) language modeling perplexity, (3) downstream task performance, and (4) inference efficiency.
Researcher Affiliation	Academia	1EPFL 2Northeastern University 4Université Grenoble Alpes, CNRS, Grenoble INP, LIG EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the LZW algorithm and its integration into the architecture but does not provide a structured pseudocode block or algorithm section for the method.
Open Source Code	Yes	Code and models are released at https://github.com/epfl-dlab/zip2zip.
Open Datasets	Yes	We evaluate the perplexity of zip2zip models on four corpora: Wikitext [Merity et al., 2016], The Pile [Gao et al., 2020], and two subsets of Paloma [Magnusson et al., 2023]: m C4, a multilingual subset of C4, and d C4 (aka C4-100D)... To support effective fine-tuning, we construct a curated dataset with balanced representation across diverse domains, including code, mathematics, dialogue, general web content, and multilingual text. The final dataset contains approximately 1 billion compressed tokens. Table 14 summarizes the constituent datasets and their respective proportions. Hugging Face FW/fineweb-edu[Lozhkov et al., 2024a]... devngho/the-stack-llm-annotations-v2[Lozhkov et al., 2024b]... AI-MO/Numina Math-1.5[LI et al., 2024]... Hugging Face H4/ultrachat_200k[Ding et al., 2023]... Hugging Face FW/fineweb-2[Penedo et al., 2024].
Dataset Splits	No	To support effective fine-tuning, we construct a curated dataset with balanced representation across diverse domains, including code, mathematics, dialogue, general web content, and multilingual text. The final dataset contains approximately 1 billion compressed tokens. Table 14 summarizes the constituent datasets and their respective proportions... Validation Interval: Every 100 steps. While the paper describes the composition of its training data and mentions validation intervals, it does not explicitly detail the train/test/validation splits for its custom 1B token pretraining dataset or how standard benchmark datasets were explicitly split for evaluation.
Hardware Specification	Yes	Hardware: 4 NVIDIA A100-SXM4-80GB GPUs, 64-core CPU (128 threads)... All training was conducted on internal servers equipped with NVIDIA H100 GPUs... Hardware: Apple M1 (16GB RAM)... NVIDIA H100 80GB GPU
Software Dependencies	Yes	Key Libraries: Py Torch >= 2.5.0 Transformers >= 4.47.0 Datasets <= 3.1.0 Accelerate >= 0.26.0
Experiment Setup	Yes	Pretrained Model: microsoft/Phi-3-medium-4k-instruct Sequence Length: 1024 Total Batch Size: 32,768 tokens Learning Rate Schedule: Cosine decay Learning Rate Range: Max = 3e-4, Min = 1e-5 Lo RA rank and alpha value: Both are 32 Training Steps: 10,000 Validation Interval: Every 100 steps Checkpoint Interval: Every 500 steps Pytorch Model Compilation: Enabled... Rank: 16 Alpha: 16 Target Modules: qkv_proj, o_proj, gate_proj, down_proj, up_proj... The loss weighting coefficient λ was chosen to be 0.1... We set the maximum merge size to M = 3 and use a two-layer transformer encoder as the hyper-encoder.