Distraction-Based Neural Networks for Modeling Document

Authors: Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Without engineering any features, we train the models on two large datasets. The models achieve the state-of-the-art performance, and they significantly benefit from the distraction modeling, particularly when input documents are long.
Researcher Affiliation Collaboration 1University of Science and Technology of China, Hefei, China 2York University, Canada 3i FLYTEK Research, Hefei, China
Pseudocode Yes Algorithm 1 Beam search with distraction
Open Source Code Yes We make our code publicly available2. Our implementation uses python and is based on the Theano library [Bergstra et al., 2010]. Footnote 2: Our code is available at https://github.com/lukecq1231/nats
Open Datasets Yes We experiment with our summarization models on two publicly available corpora with different document lengths and in different languages: a CNN news collection [Hermann et al., 2015] and a Chinese corpus made available more recently in [Hu et al., 2015].
Dataset Splits Yes We used the original training/testing split mentioned in [Hu et al., 2015], but additionally randomly sampled a small part of the training data as our validation set. Table 1: CNN LCSTS Train Valid Test Train Valid Test ... # Doc. 81,824 1,184 1,093 2,400,000 591 725
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions 'python' and 'Theano library' but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes We used mini-batch stochastic gradient descent (SGD) to optimize log-likelihood, and Adadelta [Zeiler, 2012] to automatically adapt the learning rate of parameters ( = 10 6 and = 0.95). For the CNN dataset, training was performed with shuffled mini-batches of size 64... We limit our vocabulary to include the top 25,000 most frequent words... we set embedding dimension to be 120, the vector length in hidden layers to be 500 for uni-GRU and 600 for bi-GRU. An end-of-sentence token was inserted between every sentence, and an end-of-document token was added at the end. The beam size of decoder was set to be 5.