Distraction-Based Neural Networks for Modeling Document
Authors: Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Without engineering any features, we train the models on two large datasets. The models achieve the state-of-the-art performance, and they significantly benefit from the distraction modeling, particularly when input documents are long. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, Hefei, China 2York University, Canada 3i FLYTEK Research, Hefei, China |
| Pseudocode | Yes | Algorithm 1 Beam search with distraction |
| Open Source Code | Yes | We make our code publicly available2. Our implementation uses python and is based on the Theano library [Bergstra et al., 2010]. Footnote 2: Our code is available at https://github.com/lukecq1231/nats |
| Open Datasets | Yes | We experiment with our summarization models on two publicly available corpora with different document lengths and in different languages: a CNN news collection [Hermann et al., 2015] and a Chinese corpus made available more recently in [Hu et al., 2015]. |
| Dataset Splits | Yes | We used the original training/testing split mentioned in [Hu et al., 2015], but additionally randomly sampled a small part of the training data as our validation set. Table 1: CNN LCSTS Train Valid Test Train Valid Test ... # Doc. 81,824 1,184 1,093 2,400,000 591 725 |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'python' and 'Theano library' but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | We used mini-batch stochastic gradient descent (SGD) to optimize log-likelihood, and Adadelta [Zeiler, 2012] to automatically adapt the learning rate of parameters ( = 10 6 and = 0.95). For the CNN dataset, training was performed with shuffled mini-batches of size 64... We limit our vocabulary to include the top 25,000 most frequent words... we set embedding dimension to be 120, the vector length in hidden layers to be 500 for uni-GRU and 600 for bi-GRU. An end-of-sentence token was inserted between every sentence, and an end-of-document token was added at the end. The beam size of decoder was set to be 5. |