Zero-Shot Tokenizer Transfer

Authors: Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence.
Researcher Affiliation Academia Benjamin Minixhofer [SEP] Edoardo M. Ponti [CLS] Ivan Vuli c [SEP] [SEP]University of Cambridge [CLS]University of Edinburgh
Pseudocode Yes Algorithm 1 Hypernetwork training loop for Zero-Shot Tokenizer Transfer
Open Source Code No Code is not submitted alongside this paper but will be provided upon publication.
Open Datasets Yes We use the English subset of the MADLAD-400 corpus (Kudugunta et al., 2023) and code from the Star Coder data (Li et al., 2023) for hypernetwork training. For the n-shot experiments, we also train on the Star Coder data, but substitute the English section of the MADLAD-400 corpus for Flan v2 (Longpre et al., 2023) sampled as in Soldaini et al. (2024).
Dataset Splits No The paper describes using established datasets for training (MADLAD-400, Star Coder, Flan v2) and evaluation (Pi QA, Hella Swag, ARC, Bool Q, MMLU, Human Eval Pack, XNLI, XCOPA, multilingual MMLU) but does not explicitly state the specific training, validation, or test splits for these datasets within the paper's text for its own experiments.
Hardware Specification Yes Training takes around one day for the XLM-R hypernetwork on a TPU v3-8 and three days for the Mistral-7B hypernetwork on a TPU v4-32 pod.
Software Dependencies Yes In practice, we use the CPLEX v22.1 (IBM ILOG, 2022) solver.
Experiment Setup Yes Appendix D Additional Hyperparameters. Table 9: Hypernetwork hyperparameters. Optimizer Adam W (Loshchilov & Hutter, 2019) (β1, β2) (0.9, 0.95) weight decay 0.01 Max. global gradient norm 0.1 Sequence length 128 Batch size 128 Steps 200000 of which MIMICK-style warmup steps 10000 MIMICK-style warmup learning rate schedule linear warmup to 3-e4 Main learning rate schedule linear warmup to 6e-5 until 10k, then cosine decay to 6e-6 Tokenizer sampling Vocabulary size 32768 Distribution of noise level z µ = ln(10 5), σ = 4 Batch size m 2048 Auxiliary loss weight 0.5 Hypernetwork num. layers 3 max. sequence length 7 (English + Code) or 15 (multilingual) hidden dimension dmodel FFN dimension 2dmodel num. attention heads min(dmodel/64, 32)