Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling
Authors: Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Haihua Du, Ben C.H. Ao9130-9137
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In order to examine the effectiveness of the proposed models, we collect 20 million monolingual corpus for each of Mandarin and Cantonese, which are official language and the most widely used dialect in China. Experimental results reveal that our methods outperform rule-based simplified and traditional Chinese conversion and conventional unsupervised translation models over 12 BLEU scores. |
| Researcher Affiliation | Academia | Yu Wan, Baosong Yang, Derek F. Wong, Lidia S. Chao, Haihua Du, Ben C.H. Ao NLP2CT Lab, Department of Computer and Information Science, University of Macau {nlp2ct.ywan, nlp2ct.baosong, nlp2ct.duhaihua, nlp2ct.benao}@gmail.com, {derekfw,lidiasc}@um.edu.mo |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes and data are released at: https://github.com/ NLP2CT/Unsupervised Dialect Translation. |
| Open Datasets | Yes | The lack of CAN monolingual corpora with strong colloquial features is serious obstacle in our research. Existing CAN corpora, such as HKCan Cor (Luke and Wong 2015) and CANCORP (Lee and Wong 1998), all have the following shortcomings: 1) they were collected in rather early years, the linguistic features of which vary from the current ones due to language evolution; and 2) they are scarce for data-intensive unsupervised training. Due to the fact that colloquial corpora possess more distinguished linguistic features of CAN, we collect CAN sentences among domains including talks, comments and dialogues from scratch.4 In order to maintain the consistency of training sets, MAN corpora are also derived from same domains as CAN from Chinese Nlp Corpus and Large Scale Chinese Corpus for NLP.5 |
| Dataset Splits | Yes | Parallel sentence pairs from dialogues are manually selected by native CAN and MAN speakers. Consequently, 1,227 and 1,085 sentence pairs are selected as development and test set, respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using TRANSFORMER and fasttext, but does not provide specific version numbers for these or other software dependencies required to reproduce the experiments. |
| Experiment Setup | Yes | We refer to the parameter setting of Lample et al. (2018c), and implement our approach on top of their source code.7 We use BLEU score as the evaluation metric. The training of each model was early-stopped to maximize BLEU score on the development set. All the embeddings are pretrained using fasttext (Bojanowski et al. 2017),8 and pivot embeddings are derived from concatenated training corpora. In the procedure of training, λdiv is set to 1.0, while λcom is linearly decayed from 1.0 at the beginning to 0.0 at the step being 200k. |