Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Leveraging Online User Feedback to Improve Statistical Machine Translation

Authors: Lluís Formiga, Alberto Barrón-Cedeño, Lluís Màrquez, Carlos A. Henríquez, José B. Mariño

JAIR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform a thorough evaluation on a real-world dataset collected from the Reverso.net translation service and show that every step in our methodology contributes signiﬁcantly to improve a general purpose SMT system. Interestingly, the quality improvement is not only due to the increase of lexical coverage, but to a better lexical selection, reordering, and morphology. Finally, we show the robustness of the methodology by applying it to a diﬀerent scenario, in which the new examples come from an automatically Web-crawled parallel corpus. Using exactly the same architecture and models provides again a signiﬁcant improvement of the translation quality of a general purpose baseline SMT system.
Researcher Affiliation	Collaboration	Llu ıs Formiga EMAIL Verbio Technologies, S.L., Loreto, 44, 08029 Barcelona Alberto Barr on-Cede no EMAIL Llu ıs M arquez EMAIL Qatar Computing Research Institute Hamad Bin Khalifa University, Tornado Tower, Floor 10, P.O. Box 5825, Doha, Qatar Carlos A. Henr ıquez EMAIL Jos e B. Mari no EMAIL TALP Research Center Universitat Polit ecnica de Catalunya, Jordi Girona, 1-3, 08034 Barcelona
Pseudocode	Yes	Algorithm 1 Sim Ter. A pivot-based algorithm to align SRC and UE through TGT
Open Source Code	No	The paper references a third-party tool's repository: "Matecat (2015). Matecat oﬃcial repository. https://github.com/matecat/Mate Cat. Accessed: 2015-07-24." However, there is no explicit statement or link indicating that the authors have made their own code for the described methodology publicly available.
Open Datasets	Yes	As training material we used the English Spanish Faust Feedback Filtering (FFF+) 2 corpus, developed within the FAUST EU project.It contains 550 examples of real translation requests and user-edits from the Reverso.net translation Web service. Available at ftp://mi.eng.cam.ac.uk/data/faust/UPC-Mar2013-FAUST-feedback-annotation.tgz. We selected diﬀerent datasets for these experiments. In order to optimize the β parameters of the similarity function in Equation (1), we used the Europarl v6 corpus, EPPS (Koehn, 2005), to build a base phrase-based SMT system. In order to tune the α and λ parameters, and to validate the proposed methodology, we used the corpora from the WMT 12 campaign (Callison-Burch, Koehn, Monz, Post, Soricut, & Specia, 2012). In the second scenario, new material is selected (cf. Section 5.2) from Common Crawl (Smith et al., 2013)
Dataset Splits	Yes	We used SVMlight (Joachims, 1999) with linear, polynomial, and RBF kernels and we tuned the classiﬁers with 90% of the FFF+ corpus. The remaining 10% was left aside for testing purposes. Additionally, we used the WMT 08-11 test material for tuning the α and the TM s λs (dev), and WMT 12/13 tests for testing the methodology (test12 and test13). In our experiments we considered the FAUST dev Clean version for tuning (less error prone), and the real FAUST test Raw for testing.
Hardware Specification	Yes	These ﬁgures were computed on a Linux server with 96 GB of RAM and 24-core CPU Xeon processors 1.6 GHz (134064 Bogomips in total).
Software Dependencies	No	The paper mentions several software tools and algorithms, such as "SVMlight (Joachims, 1999)", "Moses training with EPPS (Koehn & Hoang, 2007)", and the "Freeling suite of NLP analyzers (Padr o, Collado, Reese, Lloberes, & Castell on, 2010)". However, specific version numbers for these tools or any other critical software libraries used for the implementation are not provided.
Experiment Setup	Yes	We trained support vector machines (SVM) with the previously described features to learn the classiﬁers. We used SVMlight (Joachims, 1999) with linear, polynomial, and RBF kernels and we tuned the classiﬁers with 90% of the FFF+ corpus. The remaining 10% was left aside for testing purposes. Feature values were clipped to ﬁt into the range µ 3 σ2 to decrease the impact of outliers. Normalization was then applied by means of z-score: x = (x µ)/σ. Our training strategy aimed at optimizing F1 and consisted of two iterative steps: (a) parameter tuning: a grid search for the most appropriate SVM parameters (Hsu, Chang, & Lin, 2003), and (b) feature selection: a wrapper strategy, implementing backward elimination to discard redundant or irrelevant features (Witten & Frank, 2005, p. 294). We built the baseline SMT system following the standard pipeline of a Moses phrase-based system (Koehn & Hoang, 2007) from words into words and POS tags (Formiga et al., 2012). When combining the translation models, the BLEU improved from 27.86 to 28.75, achieving its highest value with α = 0.6 (i.e., a 60 40% distribution of the weight for the base and edited translation models, respectively).