Improving Simultaneous Machine Translation with Monolingual Data
Authors: Hexuan Deng, Liang Ding, Xuebo Liu, Meishan Zhang, Dacheng Tao, Min Zhang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Preliminary experiments on En Zh and En Ja news domain corpora demonstrate that monolingual data can significantly improve translation quality (e.g., +3.15 BLEU on En Zh). Experimental results show that our sampling strategy consistently outperforms the random sampling strategy (and other conventional typical NMT monolingual sampling strategies) by avoiding the key problem of Si MT hallucination, and has better scalability. ... Experiments Experimental Setup Bilingual Data We conduct experiments on two widely-used Si MT language directions: English-Chinese (En Zh) and English-Japanese (En Ja). |
| Researcher Affiliation | Collaboration | Hexuan Deng1*, Liang Ding2, Xuebo Liu1 , Meishan Zhang1, Dacheng Tao2, Min Zhang1 1 Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China 2 JD Explore Academy, JD.com Inc. 22s051030@stu.hit.edu.cn, dingliang1@jd.com, liuxuebo@hit.edu.cn, zhangmeishan@hit.edu.cn, dacheng.tao@gmail.com, zhangmin2021@hit.edu.cn |
| Pseudocode | No | The paper describes its methods and strategies in prose but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Data and codes can be found at https://github.com/hexuandeng/Mono4Si MT. |
| Open Datasets | Yes | For En Zh, we use CWMT Corpus1 (Chen and Zhang 2019) as training data... For En Ja, we use JPara Crawl2 (Morishita, Suzuki, and Nagata 2020) and Wiki Matrix3 (Schwenk et al. 2021) as training data... We publicly release our processed datasets4. |
| Dataset Splits | Yes | For En Zh, we use CWMT Corpus1 (Chen and Zhang 2019) as training data, NJU-newsdev2018 as the validation set and report results on CWMT2008, CWMT2009, and CWMT2011; For En Ja, we use JPara Crawl2 (Morishita, Suzuki, and Nagata 2020) and Wiki Matrix3 (Schwenk et al. 2021) as training data, newsdev2020 as the validation set and report results on newstest2020. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types, memory) for running its experiments. It mentions 'We train all models with identical training steps' but provides no hardware details. |
| Software Dependencies | No | The paper mentions several software tools used (e.g., Sentence Piece, Sacre BLEU, Simul Eval, fast-align, Ken LM) along with their corresponding citations, but it does not provide specific version numbers for these software dependencies, which are required for reproducibility. |
| Experiment Setup | No | The paper describes general aspects of the model architecture (BASE Transformer, causal encoders, wait-k policy) and training process ('identical training steps'), but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings. |