Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Towards Explainable Evaluation Metrics for Machine Translation

Authors: Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as Chat GPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations.
Researcher Affiliation Collaboration Christoph Leiter EMAIL Natural Language Learning Group University of Mannheim B6 26, 68159 Mannheim, Germany Piyawat Lertvittayakumjorn EMAIL Imperial College London Marina Fomicheva EMAIL University of Sheffield Wei Zhao EMAIL University of Aberdeen Heidelberg Institute for Theoretical Studies Yang Gao EMAIL Royal Holloway, University of London Steffen Eger EMAIL University of Mannheim
Pseudocode No The paper describes various methods and approaches but does not contain any structured pseudocode or algorithm blocks. It is a survey paper summarizing existing techniques.
Open Source Code No The paper is a survey and conceptual paper; therefore, it does not present new methodology that would typically be accompanied by open-source code. There is no statement about releasing code or a link to a code repository for the work described in this paper.
Open Datasets No The paper is a conceptual and survey paper and does not conduct its own experiments. Therefore, no datasets are "used in the experiments" by this paper. It discusses various datasets used by other research papers in the field but does not present its own experimental results requiring a dataset.
Dataset Splits No The paper is a conceptual and survey paper and does not conduct its own experiments. Therefore, it does not specify any training/test/validation dataset splits.
Hardware Specification No The paper is a conceptual and survey paper and does not conduct its own experiments. Therefore, no specific hardware details are mentioned for running experiments.
Software Dependencies No The paper is a conceptual and survey paper and does not conduct its own experiments. Therefore, no specific software dependencies with version numbers are mentioned for replicating experiments.
Experiment Setup No The paper is a conceptual and survey paper and does not conduct its own experiments. Therefore, it does not provide specific experimental setup details such as hyperparameters or training configurations.