Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Prompting Large Language Model for Machine Translation: A Case Study

Authors: Biao Zhang, Barry Haddow, Alexandra Birch

ICML 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with GLM130B (Zeng et al., 2022) as the testbed show that 1) the number and the quality of prompt examples matter, where using suboptimal examples degenerates translation; 2) several features of prompt examples, such as semantic similarity, show significant Spearman correlation with their prompting performance; yet, none of the correlations are strong enough; 3) using pseudo parallel prompt examples constructed from monolingual data via zero-shot prompting could improve translation; and 4) improved performance is achievable by transferring knowledge from prompt examples selected in other settings.
Researcher Affiliation	Collaboration	*Now at Google Deep Mind; work done prior to joining Google. 1School of Informatics, University of Edinburgh. Correspondence to: Biao Zhang <EMAIL>, Barry Haddow <EMAIL>, Alexandra Birch <EMAIL>.
Pseudocode	No	The paper describes methods in prose and does not include any explicitly labeled “Pseudocode” or “Algorithm” blocks or structured code-like procedures.
Open Source Code	No	The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We perform major analysis on FLORES (Wiki domain, En-De-Zh, NLLB Team et al., 2022) and WMT21 (News domain, En-De, En-Zh, Akhbardeh et al., 2021), and also report results on Multi-Domain (IT, Law and Medical domain, De-En, Aharoni & Goldberg, 2020) to examine domain robustness and transfer ability, and PDC (News domain, Zh En, Sun et al., 2022) for document-level translation.
Dataset Splits	Yes	To understand the relation between prompt examples and their prompting performance, we construct an Ablation set for Wiki, WMT and Multi Domain (IT and Medical) based on the dev set of FLORES, WMT21 and Multi-Domain, separately, where we randomly sample 100 instances as the ablation test set and use the rest as the default example selection pool. To distinguish, we will refer to the official dev and test set as Full set. Detailed statistics are listed in Table 1.
Hardware Specification	Yes	We adopt beam search for MT with a beam size of 2, and perform experiments with 4 RTX 3090 or A100-40G GPUs.
Software Dependencies	Yes	We evaluate translation performance using both a surfacebased metric, detokenized case-sensitive BLEU from Sacre BLEU (Post, 2018) (with the option -tok zh for Chinese), and a model-based metric, COMET from unbabel-comet with the model wmt20-comet-da (Rei et al., 2020).
Experiment Setup	Yes	We adopt beam search for MT with a beam size of 2, and perform experiments with 4 RTX 3090 or A100-40G GPUs.