reproducibilityindex.ai

Teaching Large Language Models to Translate with Comparison

Authors: Jiali Zeng, Fandong Meng, Yongjing Yin, Jie Zhou

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation on four language directions of WMT2022 and FLORES-200 benchmarks shows the superiority of our proposed method over existing methods. We evaluate our proposed method on WMT22 and FLORES-200 test sets (EN DE, EN ZH), and the improvement over the baselines shows the effectiveness of our method.
Researcher Affiliation	Industry	Pattern Recognition Center, We Chat AI, Tencent Inc {lemonzeng,fandongmeng,yongjingyin,withtomzhou}@tencent.com
Pseudocode	No	The paper describes its method and calculations in narrative text and mathematical formulas, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Please refer to Github for more details: https://github.com/lemon0830/TIM.
Open Datasets	Yes	To avoid data leakage (Garcia et al. 2023), we use the latest WMT22 test set and FLORES-200 dev-test: 1) We use the test sets from WMT22 competition2, which consist of more recent content from diverse domains such as news, social, e-commerce, and conversational domains. The test sets comprise 1984, 2037, 1875, and 2037 samples for the German-to-English (De En), English-to-German (En De), Chinese-to-English (Zh En), and English-to Chinese (En Zh) language pairs, respectively. 2) We use the dev-test split from the FLORES-200 benchmarks3. This dataset includes 1,012 sentences extracted from English Wikipedia, covering a broad range of topics and domains. Professional translators have carefully checked these sentences into approximately 200 languages. The training data for TIM-(*) consists of the alpaca dataset, the WMT translation data, the Dictionary-guided data, Orderguided data constructed from the WMT validation data, and Error-guided data constructed from MQM data.
Dataset Splits	Yes	2) We use the dev-test split from the FLORES-200 benchmarks3. This dataset includes 1,012 sentences extracted from English Wikipedia, covering a broad range of topics and domains. Professional translators have carefully checked these sentences into approximately 200 languages. The training data for TIM-(*) consists of the alpaca dataset, the WMT translation data, the Dictionary-guided data, Orderguided data constructed from the WMT validation data, and Error-guided data constructed from MQM data.
Hardware Specification	Yes	We conducted fine-tuning on eight NVIDIA A100 GPUs, utilizing the Deep-Speed Ze RO stage3 for model parallelism.
Software Dependencies	No	The paper mentions 'Deep-Speed Ze RO stage3' as a tool for model parallelism but does not specify its version number, nor does it list version numbers for other software components like programming languages, frameworks (e.g., PyTorch, TensorFlow), or specific libraries.
Experiment Setup	Yes	We fine-tuned all models for 1 epoch with a batch size of 128, while imposing a maximum text length of 512. The learning rates are 2e-5 for Fix Emb and Full, and 3e-4 for Lo RA, respectively. The weight decay parameter is set to 0.0.