Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Calibrating Translation Decoding with Quality Estimation on LLMs

Authors: Di Wu, Yibin Lei, Christof Monz

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate effectiveness over recent preference optimization methods, such as CPO [Xu et al., 2024], for translation. ... Table 1 presents the results for the Tower series under an off-policy setting... Figure 2 depicts the Spearman score... We use the WMT22 metric meta-evaluation dataset...
Researcher Affiliation	Academia	Di Wu Yibin Lei Christof Monz University of Amsterdam EMAIL
Pseudocode	No	The paper describes the methodology in narrative text and mathematical formulations but does not present any dedicated pseudocode or algorithm blocks.
Open Source Code	Yes	The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released: https://github.com/moore3930/calibrating-llm-mt.
Open Datasets	Yes	For the training set, we merge all English sentences from the Flores-200 dataset [Costa-Jussà et al., 2022] in dev and devtest splits... For fair comparison with the ALMA and Tower model series, we evaluate translation performance on the WMT22 and WMT24 datasets [Zerva et al., 2022, Kocmi et al., 2024a]... We use the MAPLE dataset [Zhu et al., 2024], which includes four translation directions, as the calibration set.
Dataset Splits	Yes	For the training set, we merge all English sentences from the Flores-200 dataset [Costa-Jussà et al., 2022] in dev and devtest splits, and use them as the source, consisting of 2,009 samples. ... All testsets in WMT24 are paragraph-level and share the same English parts, consisting of 960 samples for each direction. WMT22 consists of 22 language directions... Each direction contains 2037 sentence pairs. ... NTREX [Federmann et al., 2022], which contains 1,997 samples per direction. ... We randomly sample 200 translation outputs in the directions of en zh/ru for both the baseline model (Tower Instruct-Mistral-7B) and our calibrated model from the WMT24 dataset.
Hardware Specification	Yes	All experiments use H100 GPUs, with 7B models trained on one GPU and 13B models trained on two GPUs.
Software Dependencies	Yes	The corresponding metric model versions are Unbabel/wmt22-comet-da, Unbabel/XCOMET-XXL, Unbabel/wmt23-cometkiwi-da-xl, and Unbabel/wmt23-cometkiwi-da-xxl, respectively.
Experiment Setup	Yes	For all experiments, we train models using Lo RA [Hu et al., 2022] with rank 8, setting α to 32 and dropout to 0.05. Training uses a batch size of 32, gradient accumulation of 8 steps, and sequences capped at 512 tokens. We train each model for 3 epochs, selecting checkpoints based on the best validation performance measured by XCOMET on NTREX [Federmann et al., 2022], which contains 1,997 samples per direction. To ensure robust results, we experiment with learning rates ranging from 1e-5 to 1e-4, reporting the best results for all settings. Adam [Kingma and Ba, 2014] is used as the optimizer. Unless otherwise specified (e.g., 6.1), we use Comet Kiwi-XXL as signal during training and report results in XCOMET, COMET, and Comet Kiwi-XL.