reproducibilityindex.ai

A Dataset for Low-Resource Stylized Sequence-to-Sequence Generation

Authors: Yu Wu, Yunli Wang, Shujie Liu9290-9297

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We construct two large-scale, multiple-reference datasets for low-resource stylized S2S, the Machine Translation Formality Corpus (MTFC) that is easy to evaluate and the Twitter Conversation Formality Corpus (TCFC) that tackles an important problem in chatbots. These datasets contain context to source style parallel data, source style to target parallel data, and non-parallel sentences in the target style to enable the semi-supervised learning. We provide three baselines, the pivot-based method, the teacher-student method, and the back-translation method. We ﬁnd that the pivot-based method is the worst, and the other two methods achieve the best score on different metrics.
Researcher Affiliation	Collaboration	Yu Wu,1 Yunli Wang,2 Shujie Liu1 1Microsoft Research, Beijing, China 2State Key Lab of Software Development Environment, Beihang University, Beijing, China {yuwu1, shujliu}@microsoft.com {wangyunli}@buaa.edu.cn
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Data and code are shared at https://github.com/Mark Wu NLP/Data4Stylized S2S
Open Datasets	Yes	extending the Grammarly s Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault 2018)... The MTFC consists of 15 million informal Chinese to informal English text pairs which are carefully ﬁltered from the Open Subtitle Dataset (Lison and Tiedemann 2016).
Dataset Splits	Yes	We ask two native speakers to transfer 2000 responses to formal responses for testing (1000 for tuning and 1000 for testing)... The GYAFC provides 2877 and 1416 informal-formal English sentence pairs in the Entertainment & Music domain for tuning and test... Table 3: Corpus statistics. In the column of dataset D, three numbers are the number of sentence pairs, the average word count of x, and the average word count of ys. Similarly, three numbers of dataset S are the number of the sentence pairs, the word count of a source-style sentence ys, and the average word count of a target-style sentence yt. (Table 3 shows validation counts for MTFC as 2865 and TCFC as 980).
Hardware Specification	Yes	All models are trained on 4 Tesla Titan X GPUs for a total of 200K steps using the Adam algorithm (Kingma and Ba 2014) with β1 = 0.9 and β2 = 0.98.
Software Dependencies	No	The paper mentions software components and algorithms like Transformer, Adam, BPE, and GRU, but does not provide specific version numbers for any libraries or frameworks used (e.g., PyTorch, TensorFlow, etc.).
Experiment Setup	Yes	In the pivot-based model, the Transformer model (Vaswani et al. 2017) is adopted to approximate the conditional sequence generation probability P(ys\|x, θx ys). The transformer model consists of a 6-layer encoder and decoder, whose model size is 512. The multi-head attention quantity is 8. All models are trained on 4 Tesla Titan X GPUs for a total of 200K steps using the Adam algorithm (Kingma and Ba 2014) with β1 = 0.9 and β2 = 0.98. We employ the byte-pair encoding (BPE) approach (Sennrich, Haddow, and Birch 2016b) to handle the open vocabulary problem, whose size is 25,000. The initial learning rate is set to 0.2 and decayed according to the schedule in (Vaswani et al. 2017). During training, the batch size is 4096 words and checkpoints are created every 5000 batches. For all models, the beam size is 4 and the length penalty is 1.2.