Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
Authors: Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that CHARFORMER outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. |
| Researcher Affiliation | Industry | Yi Tay , Vinh Q. Tran , Sebastian Ruder , Jai Gupta, Hyung Won Chung, Dara Bahri Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler Google Research and Deep Mind yitay@google.com, vqtran@google.com |
| Pseudocode | Yes | For additional clarity, we include a simplified implementation of the GBST module in Tensorflow below. |
| Open Source Code | Yes | All code to train the core byte-level Transformer encoder-decoder for CHARFORMER its variants is already open-sourced as a part of the Mesh Tensorflow6 (Shazeer et al., 2018), T57 (Raffel et al., 2020), and By T58 (Xue et al., 2021) libraries. Additionally, an implementation of Charformer GBST compatible with existing open-source models has been open-sourced9. https://github.com/google-research/google-research/tree/master/charformer |
| Open Datasets | Yes | We evaluate on a diverse set of standard English tasks from GLUE covering sentiment classification (SST-2; Socher et al., 2013), natural language inference (MNLI, QNLI; Williams et al., 2018; Rajpurkar et al., 2016), paraphrase detection (Dolan and Brockett, 2005, MRPC, QQP) and sentence similarity (Cer et al., 2017). ...We pre-train all models on the C4 corpus... We pre-train CHARFORMER as well as the Byte-level T5 and Byte-level T5+LASC baselines on multilingual m C4 Common Crawl (Xue et al., 2020) in 101 languages. |
| Dataset Splits | Yes | Checkpoints were picked based on the dev set metrics, and then evaluated on test set. |
| Hardware Specification | Yes | Each model is pre-trained on 16 TPU V3 chips. All experiments were run on 16 TPU-v3 chips... |
| Software Dependencies | No | The paper mentions using 'Mesh Tensorflow', 'T5', and 'By T5' libraries, but does not specify their version numbers. |
| Experiment Setup | Yes | We pre-train all models on the C4 corpus for 1M steps using a batch size of 64 and sequence length of 1024. All non-subword models use a vocabulary of 256 bytes. Our pre-training scheme corrupts spans with a mean length of 20 bytes. ... We pre-train our models with the Adafactor optimizer with an inverse square root learning rate. We then fine-tune on each individual task separately using a constant learning rate of 10 3. ...Our small model follows the T5 small model size with 6 encoder layers and 6 decoder layers, hidden size dmodel of 512, 8 heads, dkv of 32 and dff of 2048. ... For Charformer, the filter size of the pre-GBST convolution is set to 5 by default. For CHARFORMER, the downsampling rate is tuned in the range of {2, 3, 4}. |