reproducibilityindex.ai

An Analysis of Tokenization: Transformers under Markov Data

Authors: Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When trained on data drawn from certain simple kth-order Markov processes for k > 1, transformers exhibit a surprising phenomenon in the absence of tokenization, they empirically are incredibly slow or fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss.
Researcher Affiliation	Academia	Nived Rajaraman UC Berkeley nived.rajaraman@berkeley.edu Jiantao Jiao UC Berkeley jiantao@berkeley.edu Kannan Ramchandran UC Berkeley jiantao@berkeley.edu
Pseudocode	Yes	Algorithm 1 Sequential implementation of BPE
Open Source Code	Yes	Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Instructions provided in the jupyter notebook.
Open Datasets	Yes	Wikitext-103-raw-v1 dataset (Merity et al., 2016) and GLUE dataset (Wang et al., 2019).
Dataset Splits	No	The paper mentions 'test loss' and 'validation loss evaluations' but does not specify the exact percentages or counts for training, validation, and test splits. It implies the use of a validation set but lacks explicit details on its proportion or how it was separated.
Hardware Specification	Yes	we train the transformers on a single GPU on an 8 A100 node.
Software Dependencies	No	Table 3 lists 'Optimizer Adam W (β1 = 0.9, β2 = 0.95)' but does not provide specific version numbers for software dependencies like PyTorch, Python, or CUDA.
Experiment Setup	Yes	Table 3: Hyperparameter choices: Batch size Grid-searched in {8, 16, 32}, Learning rate 0.002, Scheduler Cosine, # Iterations 8000, Weight decay 1e-3, Dropout 0, Sequence length 512, Embedding dimension Grid-searched in {10, 20, 30, 40}, # layers Grid-searched in {1, 2, 4, 8}, # heads Grid-searched in {1, 2, 4, 8, 16}.