Discourse Level Factors for Sentence Deletion in Text Simplification

Authors: Yang Zhong, Chao Jiang, Wei Xu, Junyi Jessy Li9709-9716

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents a data-driven study focusing on analyzing and predicting sentence deletion a prevalent but understudied phenomenon in document simplification on a large English text simplification corpus. We inspect various document and discourse factors associated with sentence deletion, using a new manually annotated sentence alignment corpus we collected. To predict whether a sentence will be deleted during simplification to a certain level, we harness automatically aligned data to train a classification model. Evaluated on our manually annotated data, our best models reached F1 scores of 65.2 and 59.7 for this task at the levels of elementary and middle school, respectively.
Researcher Affiliation Academia Yang Zhong,*1 Chao Jiang,1 Wei Xu,1 Junyi Jessy Li2 1Department of Computer Science and Engineering, The Ohio State University 2Department of Linguistics, The University of Texas at Austin
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code No The paper states: 'To request our data, please first obtain access to the Newsela corpus at: https://newsela.com/data/, then contact the authors.' This refers to data access, not the release of their source code. No other explicit statement about releasing source code for their methodology is found.
Open Datasets Yes We use the Newsela text simplification corpus (Xu, Callison Burch, and Napoles 2015) of 936 news articles.
Dataset Splits Yes We use 15 of the manually aligned articles as the validation set and the other 35 articles as test set.
Hardware Specification No The acknowledgments section states: 'We thank NVIDIA and Texas Advanced Computing Center at UT Austin for providing GPU computing resources', but it does not specify any exact GPU models, CPU types, or other detailed hardware specifications.
Software Dependencies No The paper mentions software like PyTorch and Scikit-learn but does not provide specific version numbers for these or any other key software components, which are required for a reproducible description.
Experiment Setup Yes We use Adam (Kingma and Ba 2015) for optimization and also apply a dropout of 0.5 to prevent overfitting. We set the learning rate to 1e-5 and 2e-5 for experiments in Tables 9 and 10 respectively. We set the batch size to 64. We followed (Maddela and Xu 2018) and set the number of bins k to 10 and the adjustable fraction γ to 0.2 for the Gaussian feature vectorization layer.