reproducibilityindex.ai

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Authors: Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that CHARFORMER outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models.
Researcher Affiliation	Industry	Yi Tay , Vinh Q. Tran , Sebastian Ruder , Jai Gupta, Hyung Won Chung, Dara Bahri Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler Google Research and Deep Mind yitay@google.com, vqtran@google.com
Pseudocode	Yes	For additional clarity, we include a simpliﬁed implementation of the GBST module in Tensorﬂow below.
Open Source Code	Yes	All code to train the core byte-level Transformer encoder-decoder for CHARFORMER its variants is already open-sourced as a part of the Mesh Tensorﬂow6 (Shazeer et al., 2018), T57 (Raffel et al., 2020), and By T58 (Xue et al., 2021) libraries. Additionally, an implementation of Charformer GBST compatible with existing open-source models has been open-sourced9. https://github.com/google-research/google-research/tree/master/charformer
Open Datasets	Yes	We evaluate on a diverse set of standard English tasks from GLUE covering sentiment classiﬁcation (SST-2; Socher et al., 2013), natural language inference (MNLI, QNLI; Williams et al., 2018; Rajpurkar et al., 2016), paraphrase detection (Dolan and Brockett, 2005, MRPC, QQP) and sentence similarity (Cer et al., 2017). ...We pre-train all models on the C4 corpus... We pre-train CHARFORMER as well as the Byte-level T5 and Byte-level T5+LASC baselines on multilingual m C4 Common Crawl (Xue et al., 2020) in 101 languages.
Dataset Splits	Yes	Checkpoints were picked based on the dev set metrics, and then evaluated on test set.
Hardware Specification	Yes	Each model is pre-trained on 16 TPU V3 chips. All experiments were run on 16 TPU-v3 chips...
Software Dependencies	No	The paper mentions using 'Mesh Tensorﬂow', 'T5', and 'By T5' libraries, but does not specify their version numbers.
Experiment Setup	Yes	We pre-train all models on the C4 corpus for 1M steps using a batch size of 64 and sequence length of 1024. All non-subword models use a vocabulary of 256 bytes. Our pre-training scheme corrupts spans with a mean length of 20 bytes. ... We pre-train our models with the Adafactor optimizer with an inverse square root learning rate. We then ﬁne-tune on each individual task separately using a constant learning rate of 10 3. ...Our small model follows the T5 small model size with 6 encoder layers and 6 decoder layers, hidden size dmodel of 512, 8 heads, dkv of 32 and dff of 2048. ... For Charformer, the ﬁlter size of the pre-GBST convolution is set to 5 by default. For CHARFORMER, the downsampling rate is tuned in the range of {2, 3, 4}.