Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Tradutor: Building a Variety Specific Translation Model

Authors: Hugo Sousa, Satya Almasian, Ricardo Campos, Alipio Jorge

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.
Researcher Affiliation	Academia	1Faculty of Sciences, University of Porto, Porto, Portugal 2INESC TEC, Porto, Portugal 3Institute of Computer Science, Heidelberg University, Germany 4Department of Informatics, University of Beira Interior, Covilhã, Portugal 5Ci2 Smart Cities Research Center, Tomar, Portugal EMAIL, EMAIL
Pseudocode	No	The paper describes methodologies in paragraph form, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties. (...) The code to replicate the dataset is available as open-source7, and the final corpus is publicly accessible on Hugging Face8. (...) 7https://github.com/hmosousa/ptradutor (...) The code for training and evaluation, as well as the trained checkpoints of our models, is available in our repository9. (...) 9https://github.com/hmosousa/tradutor
Open Datasets	Yes	As part of this work, we have created and publicly released a meticulously curated parallel corpus for European Portuguese, comprising 1,719,002 documents the largest of its kind to date. (...) The final corpus is publicly accessible on Hugging Face8. (...) 8https://huggingface.co/datasets/hugosousa/PTradutor (...) For the collection of texts in European Portuguese, we used two sources: the DSL-TL (Zampieri et al. 2024) and the Pt Br Vid corpus (Sousa et al. 2025). (...) FRMT: This dataset is specifically designed to contain regional variants of Portuguese and Chinese (Riley et al. 2023) (...) NTrex: The dataset consists of high-quality translations by speakers who are bilingual in English and in one of the 128 target languages (...) (Federmann, Kocmi, and Xin 2022).
Dataset Splits	Yes	For our purposes, we use the train partition of this dataset [DSL-TL] and keep the texts labeled as European Portuguese and Both as they are both valid texts in European Portuguese, resulting in a total of 1,734 documents. (...) We evaluate our system against various baselines, on two European Portuguese benchmarks. (...) Test Benchmarks As a low-resource language variant, the number of benchmarks that include European Portuguese is limited. In this study, we use two high-quality publicly available datasets that feature this variant: FRMT (...) NTrex (...). Training runs we executed with early stopping using the test set the DSL-TL corpus as validation set with patience of 3,000 steps.
Hardware Specification	Yes	All models were trained and evaluated on a server with six A-100 GPUs, each with 40GB of memory.
Software Dependencies	No	Monolingual data was translated into English using Google Translate with the Python library deep translator4. (...) jus Text5 library designed to clean boilerplate content (...) Token count was determined using the LLama3 tokenizer. (...) training and evaluation scripts compatible with the two libraries used to train the language models: torchtune11 and transformers12.
Experiment Setup	Yes	Phi-3: For both the Lo RA and full fine-tuning of Phi3 models, we used a batch size of 512, a learning rate of 2e-5, a weight decay of 0.1, and a warm-up of 1,000 steps. Gemma-2: For both variants, the learning rate was set to 2e-5 with a weight decay of 0.1. The full fine-tuned model was trained with a 1,000-step warm-up and a batch size of 512, while the Lo RA variant had 500 warm-up steps and a batch size of 256. LLa MA-3: Both variants were trained with a batch size of 256 and a learning rate of 2e-5. The Lo RA variant additionally includes a warm-up of 100 steps and a weight decay of 0.1 on the learning rate. All Lo RA variants were trained with an alpha of 128 and a rank of 256.