The Reasonableness Behind Unreasonable Translation Capability of Large Language Model

Authors: Tingchen Fu, Lemao Liu, Deng Cai, Guoping Huang, Shuming Shi, Rui Yan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we have made significant findings. To quantify the effect of three types of bilingualism, it is too costly to train LLM from scratch and thereby we develop two computationally feasible methods as surrogates to measure the impact of data ( 4). Subsequently, we apply the two surrogate methods to BLOOM-family model (Scao et al., 2022) with our collected three types of unintentional bilingual data and compare their effects on translation capacity ( 5). Moreover, extensive experiments are conducted to glean insights into the impact of other factors (e.g., monolingual data, parameter-sharing, data volume) on the acquisition of translation capacity for LLM 5.
Researcher Affiliation Collaboration Tingchen Fu Lemao Liu Deng Cai Guoping Huang Shuming Shi Rui Yan Gaoling School of Artificial Intelligence, Renmin University of China Tencent AI Lab lucas.futingchen@gmail.com redmondliu@tencent.com ruiyan@ruc.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available in https://github.com/Tingchen Fu/ICLR24-Trans Contamination.
Open Datasets Yes If not otherwise specified, we use the unintentional bilingual data and purified monolingual data (i.e., excluding SA, WA and CS) from m C4.en and m C4.zh to perform the experiments, and their statistics are shown in Table 3. We use WMT21 news translation task (Akhbardeh et al., 2021) and the FLORES-200 (team et al., 2022) as our evaluation benchmarks, with their statistics presented in Table 2.
Dataset Splits Yes We use WMT21 news translation task (Akhbardeh et al., 2021) and the FLORES-200 (team et al., 2022) as our evaluation benchmarks, with their statistics presented in Table 2. Table 2: Statistics of our evaluation benchmarks. Numbers in brackets denote the number of instances. WMT21 English-Chinese newstest2021 (1948/1002) newstest{2017,2018,2019} FLORES-200 English eng Latn.devtest (1012) eng Latn.dev (997) Chinese zho Hans.devtest (1012) zho Hans.dev (997)
Hardware Specification Yes Our experiments are conducted on a cloud Linux server with Ubuntu 16.04 operating system. The codes are written in Python 3.10 using the code from huggingface library. The GPU type is Nvidia Tesla V100 with 32GB GPU memory.
Software Dependencies No The codes are written in Python 3.10 using the code from huggingface library. We use fasttext as our language detection tool. We use sacrebleu to measure the similarity. The paper mentions Python 3.10, but does not provide version numbers for other specific libraries or tools used like 'huggingface library', 'fasttext', or 'sacrebleu'.
Experiment Setup Yes The detailed hyper-parameter settings for post-training and from-scratch training are shown in Table 16. Table 16: The hyper-parameters for post-training and pre-training. Precision float16, Batch Size 256, Optimizer Adam W, Learning Rate 1e-5, Sequence Length 1024, Warmup Step 0, Decay style cosine, Min. Learning Rate 0, Weight Decay 1e-1, Gradient clip 1.0, Lo RA rank 8, Lo RA α 16.