Global-aware Beam Search for Neural Abstractive Summarization

Authors: Ye Ma, Zixun Lan, Lu Zong, Kaizhu Huang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on nine datasets show that the global (attention)-aware inference significantly improves state-of-the-art summarization models even using empirical hyper-parameters. The algorithm is also proven robust as it remains to generate meaningful texts with corrupted attention distributions.
Researcher Affiliation Academia Ye Ma1,4 Zixun Lan2,4 Lu Zong1,4 Kaizhu Huang3 1 Department of Financial and Actuarial Mathematics, School of Science 2 Department of Applied Mathematics, School of Science 3 Department of Intelligent Science, School of Advanced Technology 4 Laboratory for Intelligent Computing and Financial Technology Xi an Jiaotong-Liverpool University, SIP, 215123 Suzhou, China
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code Yes The codes and a comprehensive set of examples are available.2 ... 2https://github.com/yema2018/global_aware
Open Datasets Yes We evaluate the performance on totally 9 summarization datasets, where 2 datasets (CNN/DM [10], XSUM [5]) with BART [17] and 8 datasets (XSUM [5], Bill Sum [15], Multi-News [6], News Room [9], Wiki How [16], Reddit TIFU [34], ar Xiv and Pub Med [2]) with PEGASUS [39].
Dataset Splits No The attention-prediction model is trained on the training set for about 50,000 steps, and checkpoints are saved per 10,000 steps to select the best checkpoints on the development set. ... According to the numerical tests on the development set, we finally choose β = 12, γ = 1.5 for CNN/DM and β = 4, γ = 0 for XSUM.
Hardware Specification Yes All experiments are conducted on 3 RTX 6000.
Software Dependencies No The paper mentions software like "Adabelief-optimizer", "Hugging Face transformers", and "ROUGE code" but does not specify their version numbers. For example, it says "The optimizer is the Adabelief-optimizer [41] with eps 1e 16, betas (0.9, 0.999), weight decay 1e 4 and learning rate 2e 5." which are parameters, not version numbers.
Experiment Setup Yes We adopt a randomly initialized 2-layer transformer-encoder in the attention-prediction model wherethe structure of each layer is the same as the BART-encoder layer. The optimizer is the Adabelief-optimizer [41] with eps 1e 16, betas (0.9, 0.999), weight decay 1e 4 and learning rate 2e 5. The attention-prediction model is trained on the training set for about 50,000 steps, and checkpoints are saved per 10,000 steps to select the best checkpoints on the development set. ... The searching scopes of β and γ are in {2, 4, 6, 10, 12, 15, 18, 20} and {0, 0.5, 1, 1.5, 2}, respectively. According to the numerical tests on the development set, we finally choose β = 12, γ = 1.5 for CNN/DM and β = 4, γ = 0 for XSUM. As limited improvement could be observed from larger γ s, we recommend γ = 1 for normal or longer targets. When testing the global-aware inference with PEGASUS [39], we directly use empirical hyper-parameters for each dataset, namely β = 4, γ = 0 for XSUM and β = 12, γ = 1 for other 7 datasets. The beam size K follows the setups in BART [17] and PEGASUS [39].