A Dataset for Low-Resource Stylized Sequence-to-Sequence Generation

Authors: Yu Wu, Yunli Wang, Shujie Liu9290-9297

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We construct two large-scale, multiple-reference datasets for low-resource stylized S2S, the Machine Translation Formality Corpus (MTFC) that is easy to evaluate and the Twitter Conversation Formality Corpus (TCFC) that tackles an important problem in chatbots. These datasets contain context to source style parallel data, source style to target parallel data, and non-parallel sentences in the target style to enable the semi-supervised learning. We provide three baselines, the pivot-based method, the teacher-student method, and the back-translation method. We find that the pivot-based method is the worst, and the other two methods achieve the best score on different metrics.
Researcher Affiliation Collaboration Yu Wu,1 Yunli Wang,2 Shujie Liu1 1Microsoft Research, Beijing, China 2State Key Lab of Software Development Environment, Beihang University, Beijing, China {yuwu1, shujliu}@microsoft.com {wangyunli}@buaa.edu.cn
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Data and code are shared at https://github.com/Mark Wu NLP/Data4Stylized S2S
Open Datasets Yes extending the Grammarly s Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault 2018)... The MTFC consists of 15 million informal Chinese to informal English text pairs which are carefully filtered from the Open Subtitle Dataset (Lison and Tiedemann 2016).
Dataset Splits Yes We ask two native speakers to transfer 2000 responses to formal responses for testing (1000 for tuning and 1000 for testing)... The GYAFC provides 2877 and 1416 informal-formal English sentence pairs in the Entertainment & Music domain for tuning and test... Table 3: Corpus statistics. In the column of dataset D, three numbers are the number of sentence pairs, the average word count of x, and the average word count of ys. Similarly, three numbers of dataset S are the number of the sentence pairs, the word count of a source-style sentence ys, and the average word count of a target-style sentence yt. (Table 3 shows validation counts for MTFC as 2865 and TCFC as 980).
Hardware Specification Yes All models are trained on 4 Tesla Titan X GPUs for a total of 200K steps using the Adam algorithm (Kingma and Ba 2014) with β1 = 0.9 and β2 = 0.98.
Software Dependencies No The paper mentions software components and algorithms like Transformer, Adam, BPE, and GRU, but does not provide specific version numbers for any libraries or frameworks used (e.g., PyTorch, TensorFlow, etc.).
Experiment Setup Yes In the pivot-based model, the Transformer model (Vaswani et al. 2017) is adopted to approximate the conditional sequence generation probability P(ys|x, θx ys). The transformer model consists of a 6-layer encoder and decoder, whose model size is 512. The multi-head attention quantity is 8. All models are trained on 4 Tesla Titan X GPUs for a total of 200K steps using the Adam algorithm (Kingma and Ba 2014) with β1 = 0.9 and β2 = 0.98. We employ the byte-pair encoding (BPE) approach (Sennrich, Haddow, and Birch 2016b) to handle the open vocabulary problem, whose size is 25,000. The initial learning rate is set to 0.2 and decayed according to the schedule in (Vaswani et al. 2017). During training, the batch size is 4096 words and checkpoints are created every 5000 batches. For all models, the beam size is 4 and the length penalty is 1.2.