MT-Ranker: Reference-free machine translation evaluation by inter-system ranking
Authors: Ibraheem Muhammad Moosa, Rui Zhang, Wenpeng Yin
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS. 4.1 EXPERIMENTAL SETUP. 4.2 RESULTS. Table 1: Segment-level Kendall s Tau correlations on language pairs from the WMT20 Shared Metrics Task dataset. |
| Researcher Affiliation | Academia | Ibraheem Muhammad Moosa, Rui Zhang & Wenpeng Yin Department of Computer Science and Engineering Pennsylvania State University {ibraheem.moosa,rmz5227,wenpeng}@psu.edu |
| Pseudocode | No | The paper describes its methodology in Section 3 using textual descriptions and mathematical formulations (e.g., Equation 1-11), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1https://github.com/ibraheem-moosa/mt-ranker |
| Open Datasets | Yes | Our benchmark datasets are the WMT20 Shared Metrics Task dataset (DA20) (Mathur et al., 2020), the MQM20 (Freitag et al., 2021a), the MQM21 (Freitag et al., 2021b), the MQM22 (Freitag et al., 2022) and the ACES (Amrhein et al., 2022) datasets. |
| Dataset Splits | Yes | To tune hyperparameters we construct a development set from DA17,DA18 and DA19 datasets. We randomly take 50 source sentences per language pair from these datasets. We use the relative ranking samples corresponding to these source sentences as development set. The Kendall s Tau correlation on this development set is used as the validation metric. |
| Hardware Specification | Yes | The models were trained A100 GPUs. Each model was trained on a single GPU. |
| Software Dependencies | No | The paper mentions the use of the Huggingface library, XLMRoberta, M2M100, and BERTScore, but it does not specify explicit version numbers for these software dependencies (e.g., 'Huggingface Transformers version X.Y.Z'). |
| Experiment Setup | Yes | In Table 6 we show the hyperparameters used for training our models. Learning rate and batch size was tuned based on the validation metric. We use early stopping by evaluating the models every 1000 steps of and choose the checkpoint with the highest validation metric. Table 6: Hyperparameters used for training our models. Batch Size Learning Rate #Training Steps (Stage 1) #Training Steps MT-Ranker-Base 128 5 10 5 100k 20k MT-Ranker-Large 64 5 10 5 100k 20k MT-Ranker-XXL 32 1 10 5 20k 20k |