reproducibilityindex.ai

Enhancing Large Language Models through Adaptive Tokenizers

Authors: Mengyu Zheng, Hanting Chen, Tianyu Guo, Chong Zhu, Binfan Zheng, Chang Xu, Yunhe Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we outline the comprehensive experimental framework designed to assess the effectiveness of our proposed tokenizer, Adaptive Tokenizer (ADAT), in comparison to established methods such as Byte Pair Encoding (BPE) [32] and the Unigram model [19]. These evaluations utilize the Pythia [3] suite of models at various scales, leveraging a substantial corpus to ensure robust and generalizable results.
Researcher Affiliation	Collaboration	Mengyu Zheng The University of Sydney Huawei Noah s Ark Lab mzhe4259@uni.sydney.edu.au Hanting Chen Huawei Noah s Ark Lab chenhanting@huawei.com Tianyu Guo Huawei Noah s Ark Lab tianyu.guo@huawei.com Chong Zhu Huawei Noah s Ark Lab zhuchong4@huawei.com Binfan Zheng Huawei GTS AI Computing LAB zhengbinfan1@huawei.com Chang Xu The University of Sydney c.xu@sydney.edu.au Yunhe Wang Huawei Noah s Ark Lab yunhe.wang@huawei.com
Pseudocode	No	The paper includes a figure illustrating the pipeline but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code	No	The list of the dataset is included in the appendix. We will release codes after completing the necessary preparations.
Open Datasets	Yes	The study utilizes a substantial corpus extracted from The Pile [14], consisting of 56GB of raw data across 91 files. We specifically excluded subsets from DM_Mathematics and Github to ensure the relevance and quality of the data. The remaining data, approximately 16 billion tokens after a random shuffle, was tokenized using a Unigram [19] tokenizer with a vocabulary size of 50,000 tokens. A detailed enumeration of the data files used is available in Supp. A.8.
Dataset Splits	No	The paper mentions training data and test sets but does not explicitly detail validation splits or procedures. For example, it states, “We calculate PPL for all models on PG19 [29] dataset. Specifically, we use its test set and the first 2048 tokens for each book.” without specifying validation splits for training.
Hardware Specification	Yes	To measure runtime, we used 8 NVIDIA A100 GPUs, an Intel 8378A CPU, and PyTorch 2.1.2 with CUDA 12.1.
Software Dependencies	Yes	To measure runtime, we used 8 NVIDIA A100 GPUs, an Intel 8378A CPU, and PyTorch 2.1.2 with CUDA 12.1.
Experiment Setup	Yes	The Pythia-70M model is selected due to its moderate size and efficiency, which help mitigate the complexities associated with larger model architectures. It is initialized with random weights and undergoes a single training epoch using pre-training data. This data is processed with vocabularies generated from 1/10th of the training corpus (approximately 1.5 billion tokens), each containing 50,000 tokens a size consistent with the Pythia [3] setup.