Global-aware Beam Search for Neural Abstractive Summarization
Authors: Ye Ma, Zixun Lan, Lu Zong, Kaizhu Huang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on nine datasets show that the global (attention)-aware inference significantly improves state-of-the-art summarization models even using empirical hyper-parameters. The algorithm is also proven robust as it remains to generate meaningful texts with corrupted attention distributions. |
| Researcher Affiliation | Academia | Ye Ma1,4 Zixun Lan2,4 Lu Zong1,4 Kaizhu Huang3 1 Department of Financial and Actuarial Mathematics, School of Science 2 Department of Applied Mathematics, School of Science 3 Department of Intelligent Science, School of Advanced Technology 4 Laboratory for Intelligent Computing and Financial Technology Xi an Jiaotong-Liverpool University, SIP, 215123 Suzhou, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such. |
| Open Source Code | Yes | The codes and a comprehensive set of examples are available.2 ... 2https://github.com/yema2018/global_aware |
| Open Datasets | Yes | We evaluate the performance on totally 9 summarization datasets, where 2 datasets (CNN/DM [10], XSUM [5]) with BART [17] and 8 datasets (XSUM [5], Bill Sum [15], Multi-News [6], News Room [9], Wiki How [16], Reddit TIFU [34], ar Xiv and Pub Med [2]) with PEGASUS [39]. |
| Dataset Splits | No | The attention-prediction model is trained on the training set for about 50,000 steps, and checkpoints are saved per 10,000 steps to select the best checkpoints on the development set. ... According to the numerical tests on the development set, we finally choose β = 12, γ = 1.5 for CNN/DM and β = 4, γ = 0 for XSUM. |
| Hardware Specification | Yes | All experiments are conducted on 3 RTX 6000. |
| Software Dependencies | No | The paper mentions software like "Adabelief-optimizer", "Hugging Face transformers", and "ROUGE code" but does not specify their version numbers. For example, it says "The optimizer is the Adabelief-optimizer [41] with eps 1e 16, betas (0.9, 0.999), weight decay 1e 4 and learning rate 2e 5." which are parameters, not version numbers. |
| Experiment Setup | Yes | We adopt a randomly initialized 2-layer transformer-encoder in the attention-prediction model wherethe structure of each layer is the same as the BART-encoder layer. The optimizer is the Adabelief-optimizer [41] with eps 1e 16, betas (0.9, 0.999), weight decay 1e 4 and learning rate 2e 5. The attention-prediction model is trained on the training set for about 50,000 steps, and checkpoints are saved per 10,000 steps to select the best checkpoints on the development set. ... The searching scopes of β and γ are in {2, 4, 6, 10, 12, 15, 18, 20} and {0, 0.5, 1, 1.5, 2}, respectively. According to the numerical tests on the development set, we finally choose β = 12, γ = 1.5 for CNN/DM and β = 4, γ = 0 for XSUM. As limited improvement could be observed from larger γ s, we recommend γ = 1 for normal or longer targets. When testing the global-aware inference with PEGASUS [39], we directly use empirical hyper-parameters for each dataset, namely β = 4, γ = 0 for XSUM and β = 12, γ = 1 for other 7 datasets. The beam size K follows the setups in BART [17] and PEGASUS [39]. |