Structure Learning for Headline Generation

Authors: Ruqing Zhang, Jiafeng Guo, Yixing Fan, Yanyan Lan, Xueqi Cheng9555-9562

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies show that our model can significantly outperform the stateof-the-art headline generation models.
Researcher Affiliation Academia CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China University of Chinese Academy of Sciences, Beijing, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a specific repository link or an explicit statement about releasing the source code for their methodology. It mentions using TensorFlow but not providing their own implementation code.
Open Datasets Yes We evaluate our model on a public benchmark collection, i.e., the New York Times (NYT) Annotated corpus. The corpus contains over 1.8 million documents written and published by the New York Times between January 1, 1987 and June 19, 2007.
Dataset Splits Yes We randomly sample 2000 pairs to form the development and test set respectively, and the left pairs are used as the training data.
Hardware Specification Yes We run our model on a Tesla K80 GPU card
Software Dependencies No The paper states 'We implement our model in TensorFlow' but does not specify a version number for TensorFlow or any other software dependencies with their versions.
Experiment Setup Yes The dimension of word embeddings is 300, while the dimension of position embeddings is 200. We use one layer of bi-directional GRU for word encoder and another uni-directional GRU for decoder. We use three GCN hidden layers. The hidden unit size in the word encoder, word decoder and GCN is 300. The pooling parameter k is set as 12. The learning rate of Adam (Kingma and Ba 2015) is set as 0.0005. All trainable parameters are initialized in the range [ 0.1, 0.1]. For training, we use a mini-batch size of 64 and documents with similar length (in terms of the number of sentences) are organized to be a batch. Dropout with probability 0.2 is applied between vertical GRU stacks and gradient clipping is adopted by scaling gradients when the norm exceeded a threshold of 5.