Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Multi-Document Summarization with Determinantal Point Process Attention
Authors: Laura Perez-Beltrachini, Mirella Lapata
JAIR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluation shows that our attention mechanism consistently improves summarization and delivers performance comparable with the state-of-the-art on the Multi News dataset. |
| Researcher Affiliation | Academia | Laura Perez-Beltrachini EMAIL Mirella Lapata EMAIL Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, Scotland |
| Pseudocode | Yes | Algorithm 1 Approximate Computation of XMAP |
| Open Source Code | Yes | Our code and data are available at https://github.com/lauhaide/dppattn |
| Open Datasets | Yes | We evaluate our approach on two large-scale datasets which pose different challenges for abstractive MDS. Wiki Cat Sum (Perez-Beltrachini et al., 2019) is an automatically constructed dataset...while Multi News (Fabbri, Li, She, Li, & Radev, 2019) consists of professionally written summaries |
| Dataset Splits | Yes | Table 1: Number of instances in train/validation/test partitions (Pairs), average summary length (Nb.Words) and number of sentences per summary (Nb.Sents), and average lemmatoken ratio (LTR) on clusters content words. Film 51,399/2,958/2,861... Multi News 44,972/5,622/5,622 |
| Hardware Specification | No | The paper mentions training on 'a single GPU' or '2 GPUs' for different models, but does not specify the exact GPU model, CPU, or other detailed hardware specifications. For example: 'CVS2S were trained on a single GPU with batch size 5.' and 'CTF variants were trained with 2 GPUs'. |
| Software Dependencies | No | The paper mentions using 'code from https://github.com/pytorch/fairseq' and 'Open NMT base implementations of Copy Transformer'. However, it does not provide specific version numbers for PyTorch, Fairseq, OpenNMT, or any other software libraries, which is required for reproducibility. |
| Experiment Setup | Yes | All convolutional models used dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014) in the encoder and decoder with a rate of 0.2. For the normalization and initialization of the convolutional architectures, we follow Gehring et al. (2017). All CVS2S models were trained with Nesterov s accelerated gradient method, again following (Gehring et al., 2017). CVS2S were trained on a single GPU with batch size 5. For transformer-based models, we applied dropout with probability of 0.2 and label smoothing (Szegedy, Vanhoucke, Ioffe, S., Shlens, J., & Wojna, Z., 2016) with smoothing factor 0.1. The optimizer was Adam (Kingma & Ba, 2015) with learning rate of 2, β1 = 0.9, and β2 = 0.998; we also applied learning rate warm-up over the first 8,000 steps (6,000 on Animal and Company datasets), and decay as in (Vaswani et al., 2017). CTF variants were trained with 2 GPUs and batch size of 12,288 tokens for Wi Ki Cat Sum datasets; and 2 GPUs with batch size of 16,384 tokens for Multi News. Pointer-Generator models were trained with the Adagrad optimizer (Duchi, Hazan, & Singer, 2011) and learning rate of 0.15. PG models were trained for 50,000 epochs and best models were selected based on ROUGE scores on the validation set. PG variants were trained with 2 GPUs and batch size of 40 instances for Wi Ki Cat Sum datasets; and 4 GPUs with batch size of 40 instances for Multi News. We decode with a beam of size 5. We normalize the log-likelihood of the candidate hypotheses y by their length, |y|α with α = 0.9 (Wu, Schuster, Chen, Le, Norouzi, Macherey, Krikun, Cao, Gao, Macherey, Klingner, Shah, Johnson, Liu, Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., & Dean, J., 2016) for Animal and Multi News but set α = 0 on the Film and Company datasets. All CVS2S models (Fairseq) use no length normalization. We use trigram blocking (Paulus, Xiong, & Socher, 2018) with all models on the Wiki Cat Sum dataset but no coverage penalty (Gehrmann et al., 2018) as experiments indicated it was hurting performance. Trigram blocking is a hard constraint similar to coverage penalty, it aims to reduce redundancy in decoded summary S, by skipping candidate sentence c if there exists a trigram overlapping between c and S. Conversely, we use coverage penalty (β = 5) on Multi News but no trigram blocking. To make relevance scores sharper, we experimented with temperature values τ < 1 in Equation (7) (Section 3.2) at inference time. In particular for +DPP variants this would further gear the attention towards a more incremental reading of the input content elements. Indeed, within the CTF architecture, sparser relevance brought no improvements on the base and +Cov Loss variants but increased the performance of +DPP (with best τ = 0.6). |