An Analysis of Tokenization: Transformers under Markov Data

Authors: Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When trained on data drawn from certain simple kth-order Markov processes for k > 1, transformers exhibit a surprising phenomenon in the absence of tokenization, they empirically are incredibly slow or fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss.
Researcher Affiliation Academia Nived Rajaraman UC Berkeley nived.rajaraman@berkeley.edu Jiantao Jiao UC Berkeley jiantao@berkeley.edu Kannan Ramchandran UC Berkeley jiantao@berkeley.edu
Pseudocode Yes Algorithm 1 Sequential implementation of BPE
Open Source Code Yes Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Instructions provided in the jupyter notebook.
Open Datasets Yes Wikitext-103-raw-v1 dataset (Merity et al., 2016) and GLUE dataset (Wang et al., 2019).
Dataset Splits No The paper mentions 'test loss' and 'validation loss evaluations' but does not specify the exact percentages or counts for training, validation, and test splits. It implies the use of a validation set but lacks explicit details on its proportion or how it was separated.
Hardware Specification Yes we train the transformers on a single GPU on an 8 A100 node.
Software Dependencies No Table 3 lists 'Optimizer Adam W (β1 = 0.9, β2 = 0.95)' but does not provide specific version numbers for software dependencies like PyTorch, Python, or CUDA.
Experiment Setup Yes Table 3: Hyperparameter choices: Batch size Grid-searched in {8, 16, 32}, Learning rate 0.002, Scheduler Cosine, # Iterations 8000, Weight decay 1e-3, Dropout 0, Sequence length 512, Embedding dimension Grid-searched in {10, 20, 30, 40}, # layers Grid-searched in {1, 2, 4, 8}, # heads Grid-searched in {1, 2, 4, 8, 16}.