Language Modelling with Pixels
Authors: Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. |
| Researcher Affiliation | Academia | Phillip Rust1 Jonas F. Lotz1,2 Emanuele Bugliarello1 Elizabeth Salesky3 Miryam de Lhoneux5 Desmond Elliott1,6 1University of Copenhagen 2ROCKWOOL Foundation Research Unit 3Johns Hopkins University 5KU Leuven 6Pioneer Centre for AI p.rust@di.ku.dk |
| Pseudocode | Yes | Algorithm 1 PIXEL Span Masking Input: #Image patches N, masking ratio R, maximum masked span length S, span length cumulative weights W = {w1, . . . , w S} Output: Masked patches M M repeat s randchoice({1, . . . , S}, W ) l randint(0, max(0, N s)) r l + s if M {l s, . . . , l 1} = and M {r + 1, . . . , r + s} = then M M {l, . . . , r} end if until |M| > R N return M |
| Open Source Code | Yes | We make the implementation, the pretrained model including intermediate training checkpoints, and the fine-tuned models freely available for the community.3 https://github.com/xplip/pixel |
| Open Datasets | Yes | PIXEL-base is pretrained on a rendered version of the English Wikipedia and the Bookcorpus (Zhu et al., 2015)... We evaluate PIXEL on part-of-speech (POS) tagging and dependency parsing using data from Universal Dependencies v2.10 treebanks (Nivre et al., 2020; Zeman et al., 2022)... We evaluate both monolingual (ENG) and cross-lingual word-level understanding on Masakha NER (Adelani et al., 2021)... For monolingual ENG sentence-level understanding we rely on GLUE (Wang et al., 2018) and SQu AD (Rajpurkar et al., 2016)... Finally, we evaluate cross-lingual sentence-level understanding on Ty Di QA-Gold P (Clark et al., 2020)... and on two additional larger monolingual extractive question answering (QA) corpora: Kor Qu AD 1.0 (KOR; Lim et al., 2019) and Ja Qu AD (JPN; So et al., 2022). Table 9: Links and references to the datasets we used in our finetuning experiments. |
| Dataset Splits | Yes | For a), we follow Rust et al. (2021) and process the training and validation splits of all available UD v2.10 treebanks in various languages with the PIXEL renderer and the tokenizers of BERT and MBERT. and Table 3: Results for PIXEL and BERT finetuned on GLUE. We report validation set performance averaged over 5 runs. |
| Hardware Specification | Yes | Pretraining took 8 days on 8 40GB Nvidia A100 GPUs. and The computing power was generously supported by Euro HPC grants 2010PA5869, 2021D02-068, and 2021D05-141, and with Cloud TPUs from Google s TPU Research Cloud (TRC). and We estimate throughput, measured in examples per second, by how long it takes to process 1M lines of English (ENG) and Chinese (ZHO) Wikipedia text on the same desktop workstation (AMD Ryzen 9 3900X 12-core CPU). |
| Software Dependencies | No | PIXEL is implemented in Py Torch (Paszke et al., 2019) and built on Hugging Face transformers (Wolf et al., 2020). We experimented with different text rendering backends. Following Salesky et al. (2021), our first implementation was based on Py Game, which PIXEL was also pretrained with. Later on, we switched to a backend based on Pango (Taylor, 2004) and Cairographics. Specific version numbers for PyTorch or Hugging Face transformers are not provided. |
| Experiment Setup | Yes | Table 7: PIXEL pretraining settings provides extensive details on parameters and training configurations. Table 11: Finetuning settings for POS tagging, dependency parsing (DP), NER, QA, and XNLI and Table 12: Finetuning settings for GLUE tasks specify hyperparameters like learning rate, batch size, and max sequence length for finetuning. |