An Analysis of Tokenization: Transformers under Markov Data
Authors: Nived Rajaraman, Jiantao Jiao, Kannan Ramchandran
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When trained on data drawn from certain simple kth-order Markov processes for k > 1, transformers exhibit a surprising phenomenon in the absence of tokenization, they empirically are incredibly slow or fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. |
| Researcher Affiliation | Academia | Nived Rajaraman UC Berkeley nived.rajaraman@berkeley.edu Jiantao Jiao UC Berkeley jiantao@berkeley.edu Kannan Ramchandran UC Berkeley jiantao@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Sequential implementation of BPE |
| Open Source Code | Yes | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Instructions provided in the jupyter notebook. |
| Open Datasets | Yes | Wikitext-103-raw-v1 dataset (Merity et al., 2016) and GLUE dataset (Wang et al., 2019). |
| Dataset Splits | No | The paper mentions 'test loss' and 'validation loss evaluations' but does not specify the exact percentages or counts for training, validation, and test splits. It implies the use of a validation set but lacks explicit details on its proportion or how it was separated. |
| Hardware Specification | Yes | we train the transformers on a single GPU on an 8 A100 node. |
| Software Dependencies | No | Table 3 lists 'Optimizer Adam W (β1 = 0.9, β2 = 0.95)' but does not provide specific version numbers for software dependencies like PyTorch, Python, or CUDA. |
| Experiment Setup | Yes | Table 3: Hyperparameter choices: Batch size Grid-searched in {8, 16, 32}, Learning rate 0.002, Scheduler Cosine, # Iterations 8000, Weight decay 1e-3, Dropout 0, Sequence length 512, Embedding dimension Grid-searched in {10, 20, 30, 40}, # layers Grid-searched in {1, 2, 4, 8}, # heads Grid-searched in {1, 2, 4, 8, 16}. |