reproducibilityindex.ai

MT-Ranker: Reference-free machine translation evaluation by inter-system ranking

Authors: Ibraheem Muhammad Moosa, Rui Zhang, Wenpeng Yin

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS. 4.1 EXPERIMENTAL SETUP. 4.2 RESULTS. Table 1: Segment-level Kendall s Tau correlations on language pairs from the WMT20 Shared Metrics Task dataset.
Researcher Affiliation	Academia	Ibraheem Muhammad Moosa, Rui Zhang & Wenpeng Yin Department of Computer Science and Engineering Pennsylvania State University {ibraheem.moosa,rmz5227,wenpeng}@psu.edu
Pseudocode	No	The paper describes its methodology in Section 3 using textual descriptions and mathematical formulations (e.g., Equation 1-11), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1https://github.com/ibraheem-moosa/mt-ranker
Open Datasets	Yes	Our benchmark datasets are the WMT20 Shared Metrics Task dataset (DA20) (Mathur et al., 2020), the MQM20 (Freitag et al., 2021a), the MQM21 (Freitag et al., 2021b), the MQM22 (Freitag et al., 2022) and the ACES (Amrhein et al., 2022) datasets.
Dataset Splits	Yes	To tune hyperparameters we construct a development set from DA17,DA18 and DA19 datasets. We randomly take 50 source sentences per language pair from these datasets. We use the relative ranking samples corresponding to these source sentences as development set. The Kendall s Tau correlation on this development set is used as the validation metric.
Hardware Specification	Yes	The models were trained A100 GPUs. Each model was trained on a single GPU.
Software Dependencies	No	The paper mentions the use of the Huggingface library, XLMRoberta, M2M100, and BERTScore, but it does not specify explicit version numbers for these software dependencies (e.g., 'Huggingface Transformers version X.Y.Z').
Experiment Setup	Yes	In Table 6 we show the hyperparameters used for training our models. Learning rate and batch size was tuned based on the validation metric. We use early stopping by evaluating the models every 1000 steps of and choose the checkpoint with the highest validation metric. Table 6: Hyperparameters used for training our models. Batch Size Learning Rate #Training Steps (Stage 1) #Training Steps MT-Ranker-Base 128 5 10 5 100k 20k MT-Ranker-Large 64 5 10 5 100k 20k MT-Ranker-XXL 32 1 10 5 20k 20k