Prompting Large Language Model for Machine Translation: A Case Study

Authors: Biao Zhang, Barry Haddow, Alexandra Birch

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with GLM130B (Zeng et al., 2022) as the testbed show that 1) the number and the quality of prompt examples matter, where using suboptimal examples degenerates translation; 2) several features of prompt examples, such as semantic similarity, show significant Spearman correlation with their prompting performance; yet, none of the correlations are strong enough; 3) using pseudo parallel prompt examples constructed from monolingual data via zero-shot prompting could improve translation; and 4) improved performance is achievable by transferring knowledge from prompt examples selected in other settings.
Researcher Affiliation Collaboration *Now at Google Deep Mind; work done prior to joining Google. 1School of Informatics, University of Edinburgh. Correspondence to: Biao Zhang <biaojiaxing@google.com>, Barry Haddow <bhaddow@inf.ed.ac.uk>, Alexandra Birch <a.birch@ed.ac.uk>.
Pseudocode No The paper describes methods in prose and does not include any explicitly labeled “Pseudocode” or “Algorithm” blocks or structured code-like procedures.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We perform major analysis on FLORES (Wiki domain, En-De-Zh, NLLB Team et al., 2022) and WMT21 (News domain, En-De, En-Zh, Akhbardeh et al., 2021), and also report results on Multi-Domain (IT, Law and Medical domain, De-En, Aharoni & Goldberg, 2020) to examine domain robustness and transfer ability, and PDC (News domain, Zh En, Sun et al., 2022) for document-level translation.
Dataset Splits Yes To understand the relation between prompt examples and their prompting performance, we construct an Ablation set for Wiki, WMT and Multi Domain (IT and Medical) based on the dev set of FLORES, WMT21 and Multi-Domain, separately, where we randomly sample 100 instances as the ablation test set and use the rest as the default example selection pool. To distinguish, we will refer to the official dev and test set as Full set. Detailed statistics are listed in Table 1.
Hardware Specification Yes We adopt beam search for MT with a beam size of 2, and perform experiments with 4 RTX 3090 or A100-40G GPUs.
Software Dependencies Yes We evaluate translation performance using both a surfacebased metric, detokenized case-sensitive BLEU from Sacre BLEU (Post, 2018) (with the option -tok zh for Chinese), and a model-based metric, COMET from unbabel-comet with the model wmt20-comet-da (Rei et al., 2020).
Experiment Setup Yes We adopt beam search for MT with a beam size of 2, and perform experiments with 4 RTX 3090 or A100-40G GPUs.