reproducibilityindex.ai

Quantifying the Plausibility of Context Reliance in Neural Machine Translation

Authors: Gabriele Sarti, Grzegorz Chrupała, Malvina Nissim, Arianna Bisazza

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this, we introduce Plausibility Evaluation of Context Reliance (PECORE), an end-to-end interpretability framework designed to quantify context usage in language models generations. Our approach leverages model internals to (i) contrastively identify context-sensitive target tokens in generated texts and (ii) link them to contextual cues justifying their prediction. We use PECORE to quantify the plausibility of context-aware machine translation models, comparing model rationales with human annotations across several discourse-level phenomena. Finally, we apply our method to unannotated model translations to identify context-mediated predictions and highlight instances of (im)plausible context usage throughout generation.
Researcher Affiliation	Academia	Gabriele Sarti1 Grzegorz Chrupała2 Malvina Nissim1 Arianna Bisazza1 1Center for Language and Cognition (CLCG), University of Groningen 2Dept. of Cognitive Science and Artificial Intelligence (CSAI), Tilburg University {g.sarti, m.nissim, a.bisazza}@rug.nl, grzegorz@chrupala.me
Pseudocode	Yes	Algorithm 1: PECORE cue-target extraction process
Open Source Code	Yes	Code: https://github.com/gsarti/pecore. The CLI command inseq attribute-context available in the Inseq library is a generalized PECORE implementation: https://github.com/inseq-team/inseq
Open Datasets	Yes	Evaluation Datasets: To our knowledge, the only resource matching these requirements is SCAT Yin et al. (2021), an English French corpus... SCAT+: https://hf.co/datasets/inseq/scat. Additionally, we manually annotate contextual cues in DISCEVAL-MT (Bawden et al., 2018), another English French corpus... DISCEVAL-MT: https://hf.co/datasets/inseq/disc_eval_mt. Our final evaluation set contains 250 SCAT+ and 400 DISCEVAL-MT translations... We fine-tune models... on 242k IWSLT 2017 English French examples (Cettolo et al., 2017)... continue fine-tuning on the SCAT training split, containing 11k examples...
Dataset Splits	Yes	Our final evaluation set contains 250 SCAT+ and 400 DISCEVAL-MT translations across two discourse phenomena. We fine-tune models... on 242k IWSLT 2017 English French examples (Cettolo et al., 2017)... continue fine-tuning on the SCAT training split, containing 11k examples with interand intra-sentential pronoun anaphora.
Hardware Specification	No	We thank the Center for Information Technology of the University of Groningen for providing access to the Hábrók high performance computing cluster used in fine-tuning and evaluation experiments. This statement indicates a general type of computing resource but does not specify any particular CPU, GPU models, memory, or other detailed hardware specifications.
Software Dependencies	No	We evaluate two bilingual Opus MT models (Tiedemann & Thottingal, 2020)... and m BART-50 1-to-many (Tang et al., 2021), a larger multilingual MT model supporting 50 target languages, using the Transformers library (Wolf et al., 2020). The paper mentions the 'Transformers library' but does not specify its version number, nor does it list versions for other crucial software like Python, PyTorch, etc.
Experiment Setup	Yes	We fine-tune models using extended translation units (Tiedemann & Scherrer, 2017)... using a dynamic context size of 0-4 preceding sentences... To further improve models context sensitivity, we continue fine-tuning on the SCAT training split... Specifically, we apply PECORE to the context-aware Opus MT Large and m BART-50 models of Section 4.1, using KL-Divergence as CTI metric and KL as CCI attribution method. We set s CTI and s CCI to two standard deviations above the per-example average score to focus our analysis on very salient tokens.