Script, Language, and Labels: Overcoming Three Discrepancies for Low-Resource Language Specialization
Authors: Jaeseong Lee, Dohyeon Lee, Seung-won Hwang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments over four different language families and three tasks shows that our method improves the task performance of unseen languages with statistical significance, while previous approach fails to. and 3. Experiments In this section, we describe experimental settings and conduct experiments to answer the following research questions: |
| Researcher Affiliation | Academia | Jaeseong Lee, Dohyeon Lee and Seung-won Hwang* Computer Science and Engineering, Seoul National University {tbvj5914,waylight3,seungwonh}@snu.ac.kr |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and datasets we used are available.3 Footnote 3: https://github.com/thnkinbtfly/SL2 |
| Open Datasets | Yes | For NER, we utilize Wiki Ann (Pan et al. 2017) with a balanced split (Rahimi, Li, and Cohn 2019). We use Universal Dependencies (Nivre et al. 2020) version 2.5 (Zeman et al. 2019) for POS and DEP. and To be consistent with previous works (Chau and Smith 2021; Chau, Lin, and Smith 2020), we perform adaptive pretraining with Wikipedia articles extracted by WIKIEXTRACTOR,5 using 80% of them only. Our split is provided with our code. |
| Dataset Splits | Yes | When there is only a test dataset available for our target language, we perform an 8fold cross-validation with an isolated fold for the validation set, following Muller et al. (2021a). and Finetuning is performed up to 200 epochs, with early stop based on validation performance. |
| Hardware Specification | Yes | Adaptive pretraining is performed on TPUv2-8. |
| Software Dependencies | No | The paper mentions using 'Allen NLP', 'camel-tools', 'TRANSLITERATE', and 'FAST ALIGN' and provides citations for them, but it does not specify explicit version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | Then, we perform specialization via adaptive pretraining for 20 epochs, with a batch size of 16, learning rate of 2e-5, warmup of 1000 steps, only using masked language modeling (MLM) loss, following Chau and Smith (2021). [...] Finetuning is performed up to 200 epochs, with early stop based on validation performance. and We select K as the last four layers, sim as l2-norm, generate word alignments a utilizing FAST ALIGN (Dyer, Chahuneau, and Smith 2013), and perform alignment for 1 epoch, following a previous successful cross-lingual alignment method (Kulshreshtha, Redondo Garcia, and Chang 2020). The number of generated word alignments is depicted in Table 1. We consume 8 sentences per batch. |