reproducibilityindex.ai

Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling

Authors: Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Haihua Du, Ben C.H. Ao9130-9137

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In order to examine the effectiveness of the proposed models, we collect 20 million monolingual corpus for each of Mandarin and Cantonese, which are ofﬁcial language and the most widely used dialect in China. Experimental results reveal that our methods outperform rule-based simpliﬁed and traditional Chinese conversion and conventional unsupervised translation models over 12 BLEU scores.
Researcher Affiliation	Academia	Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Haihua Du, Ben C.H. Ao NLP2CT Lab, Department of Computer and Information Science, University of Macau {nlp2ct.ywan, nlp2ct.baosong, nlp2ct.duhaihua, nlp2ct.benao}@gmail.com, {derekfw,lidiasc}@um.edu.mo
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes and data are released at: https://github.com/ NLP2CT/Unsupervised Dialect Translation.
Open Datasets	Yes	The lack of CAN monolingual corpora with strong colloquial features is serious obstacle in our research. Existing CAN corpora, such as HKCan Cor (Luke and Wong 2015) and CANCORP (Lee and Wong 1998), all have the following shortcomings: 1) they were collected in rather early years, the linguistic features of which vary from the current ones due to language evolution; and 2) they are scarce for data-intensive unsupervised training. Due to the fact that colloquial corpora possess more distinguished linguistic features of CAN, we collect CAN sentences among domains including talks, comments and dialogues from scratch.4 In order to maintain the consistency of training sets, MAN corpora are also derived from same domains as CAN from Chinese Nlp Corpus and Large Scale Chinese Corpus for NLP.5
Dataset Splits	Yes	Parallel sentence pairs from dialogues are manually selected by native CAN and MAN speakers. Consequently, 1,227 and 1,085 sentence pairs are selected as development and test set, respectively.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions using TRANSFORMER and fasttext, but does not provide specific version numbers for these or other software dependencies required to reproduce the experiments.
Experiment Setup	Yes	We refer to the parameter setting of Lample et al. (2018c), and implement our approach on top of their source code.7 We use BLEU score as the evaluation metric. The training of each model was early-stopped to maximize BLEU score on the development set. All the embeddings are pretrained using fasttext (Bojanowski et al. 2017),8 and pivot embeddings are derived from concatenated training corpora. In the procedure of training, λdiv is set to 1.0, while λcom is linearly decayed from 1.0 at the beginning to 0.0 at the step being 200k.