Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching
Authors: Zhuoran Li, Chunming Hu, Junfan Chen, Zhijun Chen, Xiaohui Guo, Richong Zhang
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show our model achieves state-of-the-art results on three different zero-shot cross-lingual transfer tasks across ten languages. |
| Researcher Affiliation | Academia | 1SKLSDE, School of Computer Science and Engineering, Beihang University, Beijing, China 2School of Software, Beihang University, Beijing, China 3Hangzhou Innovation Institute, Beihang University, Hangzhou, China {lizhuoranget, hucm, zhijunchen}@buaa.edu.cn, {chenjf, guoxh, zhangrc}@act.buaa.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or a link to the open-source code for the methodology described. |
| Open Datasets | Yes | To comprehensively evaluate our proposed method, we conduct experiments on three types of cross-lingual transfer tasks with three widely used datasets. (1) For paraphrase identification, we employ PAWS-X dataset [Yang et al., 2019] containing seven languages. ... (2) For document classification, we employ MLDoc [Schwenk and Li, 2018] as our document classification dataset... (3) For spoken language understanding, we use the cross-lingual task-oriented dialogue dataset (XTOD) [Schuster et al., 2019]... |
| Dataset Splits | Yes | Table 1: Summary statistics of datasets. Dataset #Lang. #Train #Dev. #Test #Labels Metric PAWS-X 7 49,401 2,000 2,000 2 Acc. MLDoc 8 10,000 1,000 2,000 4 Acc. XTOD 3 30,521 4,181 2,368 12/11 Acc./F1 |
| Hardware Specification | Yes | All models are trained on a single Tesla V100 32GB GPU. |
| Software Dependencies | No | The paper mentions 'Hugging Face Transformer' as a backbone model and 'Adam W' as an optimizer but does not specify version numbers for these software dependencies. |
| Experiment Setup | Yes | We set the batch size to 16 or 64, the maximum sequence length to 128, and the dropout rate to 0.1, and we use Adam W as the optimizer. We select the best learning rate from {5e-6, 1e-5} for the encoder and {1e-3, 1e-5} for the task-specific network layer. As for the scheduler, we initialize τ = 0, which linearly increases as the stage increases. |