Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Authors: Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that CHARFORMER outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models.
Researcher Affiliation Industry Yi Tay , Vinh Q. Tran , Sebastian Ruder , Jai Gupta, Hyung Won Chung, Dara Bahri Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler Google Research and Deep Mind yitay@google.com, vqtran@google.com
Pseudocode Yes For additional clarity, we include a simplified implementation of the GBST module in Tensorflow below.
Open Source Code Yes All code to train the core byte-level Transformer encoder-decoder for CHARFORMER its variants is already open-sourced as a part of the Mesh Tensorflow6 (Shazeer et al., 2018), T57 (Raffel et al., 2020), and By T58 (Xue et al., 2021) libraries. Additionally, an implementation of Charformer GBST compatible with existing open-source models has been open-sourced9. https://github.com/google-research/google-research/tree/master/charformer
Open Datasets Yes We evaluate on a diverse set of standard English tasks from GLUE covering sentiment classification (SST-2; Socher et al., 2013), natural language inference (MNLI, QNLI; Williams et al., 2018; Rajpurkar et al., 2016), paraphrase detection (Dolan and Brockett, 2005, MRPC, QQP) and sentence similarity (Cer et al., 2017). ...We pre-train all models on the C4 corpus... We pre-train CHARFORMER as well as the Byte-level T5 and Byte-level T5+LASC baselines on multilingual m C4 Common Crawl (Xue et al., 2020) in 101 languages.
Dataset Splits Yes Checkpoints were picked based on the dev set metrics, and then evaluated on test set.
Hardware Specification Yes Each model is pre-trained on 16 TPU V3 chips. All experiments were run on 16 TPU-v3 chips...
Software Dependencies No The paper mentions using 'Mesh Tensorflow', 'T5', and 'By T5' libraries, but does not specify their version numbers.
Experiment Setup Yes We pre-train all models on the C4 corpus for 1M steps using a batch size of 64 and sequence length of 1024. All non-subword models use a vocabulary of 256 bytes. Our pre-training scheme corrupts spans with a mean length of 20 bytes. ... We pre-train our models with the Adafactor optimizer with an inverse square root learning rate. We then fine-tune on each individual task separately using a constant learning rate of 10 3. ...Our small model follows the T5 small model size with 6 encoder layers and 6 decoder layers, hidden size dmodel of 512, 8 heads, dkv of 32 and dff of 2048. ... For Charformer, the filter size of the pre-GBST convolution is set to 5 by default. For CHARFORMER, the downsampling rate is tuned in the range of {2, 3, 4}.