Self-Supervised Knowledge Assimilation for Expert-Layman Text Style Transfer

Authors: Wenda Xu, Michael Saxon, Misha Sra, William Yang Wang11566-11574

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that transformer-based models pretrained on knowledge base assimilation and other well-established pretraining tasks fine-tuning on our new parallel corpus leads to considerable improvement against expert-layman transfer benchmarks, gaining an average relative improvement of our human evaluation, the Overall Success Rate (OSR), by 106%.
Researcher Affiliation Academia University of California, Santa Barbara Department of Computer Science Santa Barbara, California, USA {wendaxu,saxon}@ucsb.edu {sra,william}@cs.ucsb.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/xu1998hz/SSL_KBA_ Expert_Layman_Style_Transfer.
Open Datasets Yes We evaluate our proposed method and current SOTA models using the MSD dataset (Cao et al. 2020).
Dataset Splits No MSD contains 245k medical training sentences which are each labeled with either the expert or layman style. Additionally, it contains a test set of 675 expert-layman sentence pairs of equivalent meaning. We extend the training set by producing 11,512 sentence pairs using a margin-based criterion (Schwenk 2018). The paper does not specify a validation set split with percentages or counts, although early stopping is mentioned.
Hardware Specification Yes Parallel corpus generation took 7.5 hours on a single Titan 1080 Ti GPU. For different SSL task combinations, the pretraining took 6 hours on average and fine-tuning took 1.5 hours on a single Titan 1080 Ti GPU.
Software Dependencies No We use the standard training settings for all models with Adam optimizer (Kingma and Ba 2015)... We train a style classifier on the MSD training set using Fast Text (Joulin et al. 2016)... We also use NLTK (Bird, Klein, and Loper 2009) to calculate 4-gram BLEU... We use Ken LM (Heafield 2011) to train a 5-gram language model... We use clinical-BERT s (Huang, Altosaar, and Ranganath 2020) tokenization for all models. The paper mentions software components but does not provide their specific version numbers.
Experiment Setup Yes Max sequence length, learning rate and drop out rate are set to 100, 1e 4 and 0.5 respectively. Our model architecture follows Dai et al. (2019), with 4 layers, 4 attention heads per layer, and hidden size 256. We add one style token into the input sequence with 256 hidden units after the embedding layer. Finally, we augment our expected best condition of KBA + SSL pretraining by making KBA + SSL Large, identical to the other transformer models but for a hidden size of 512.