Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multi-Document Summarization with Determinantal Point Process Attention

Authors: Laura Perez-Beltrachini, Mirella Lapata

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental evaluation shows that our attention mechanism consistently improves summarization and delivers performance comparable with the state-of-the-art on the Multi News dataset.
Researcher Affiliation	Academia	Laura Perez-Beltrachini EMAIL Mirella Lapata EMAIL Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, Scotland
Pseudocode	Yes	Algorithm 1 Approximate Computation of XMAP
Open Source Code	Yes	Our code and data are available at https://github.com/lauhaide/dppattn
Open Datasets	Yes	We evaluate our approach on two large-scale datasets which pose diﬀerent challenges for abstractive MDS. Wiki Cat Sum (Perez-Beltrachini et al., 2019) is an automatically constructed dataset...while Multi News (Fabbri, Li, She, Li, & Radev, 2019) consists of professionally written summaries
Dataset Splits	Yes	Table 1: Number of instances in train/validation/test partitions (Pairs), average summary length (Nb.Words) and number of sentences per summary (Nb.Sents), and average lemmatoken ratio (LTR) on clusters content words. Film 51,399/2,958/2,861... Multi News 44,972/5,622/5,622
Hardware Specification	No	The paper mentions training on 'a single GPU' or '2 GPUs' for different models, but does not specify the exact GPU model, CPU, or other detailed hardware specifications. For example: 'CVS2S were trained on a single GPU with batch size 5.' and 'CTF variants were trained with 2 GPUs'.
Software Dependencies	No	The paper mentions using 'code from https://github.com/pytorch/fairseq' and 'Open NMT base implementations of Copy Transformer'. However, it does not provide specific version numbers for PyTorch, Fairseq, OpenNMT, or any other software libraries, which is required for reproducibility.
Experiment Setup	Yes	All convolutional models used dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014) in the encoder and decoder with a rate of 0.2. For the normalization and initialization of the convolutional architectures, we follow Gehring et al. (2017). All CVS2S models were trained with Nesterov s accelerated gradient method, again following (Gehring et al., 2017). CVS2S were trained on a single GPU with batch size 5. For transformer-based models, we applied dropout with probability of 0.2 and label smoothing (Szegedy, Vanhoucke, Ioﬀe, S., Shlens, J., & Wojna, Z., 2016) with smoothing factor 0.1. The optimizer was Adam (Kingma & Ba, 2015) with learning rate of 2, β1 = 0.9, and β2 = 0.998; we also applied learning rate warm-up over the ﬁrst 8,000 steps (6,000 on Animal and Company datasets), and decay as in (Vaswani et al., 2017). CTF variants were trained with 2 GPUs and batch size of 12,288 tokens for Wi Ki Cat Sum datasets; and 2 GPUs with batch size of 16,384 tokens for Multi News. Pointer-Generator models were trained with the Adagrad optimizer (Duchi, Hazan, & Singer, 2011) and learning rate of 0.15. PG models were trained for 50,000 epochs and best models were selected based on ROUGE scores on the validation set. PG variants were trained with 2 GPUs and batch size of 40 instances for Wi Ki Cat Sum datasets; and 4 GPUs with batch size of 40 instances for Multi News. We decode with a beam of size 5. We normalize the log-likelihood of the candidate hypotheses y by their length, \|y\|α with α = 0.9 (Wu, Schuster, Chen, Le, Norouzi, Macherey, Krikun, Cao, Gao, Macherey, Klingner, Shah, Johnson, Liu, Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., & Dean, J., 2016) for Animal and Multi News but set α = 0 on the Film and Company datasets. All CVS2S models (Fairseq) use no length normalization. We use trigram blocking (Paulus, Xiong, & Socher, 2018) with all models on the Wiki Cat Sum dataset but no coverage penalty (Gehrmann et al., 2018) as experiments indicated it was hurting performance. Trigram blocking is a hard constraint similar to coverage penalty, it aims to reduce redundancy in decoded summary S, by skipping candidate sentence c if there exists a trigram overlapping between c and S. Conversely, we use coverage penalty (β = 5) on Multi News but no trigram blocking. To make relevance scores sharper, we experimented with temperature values τ < 1 in Equation (7) (Section 3.2) at inference time. In particular for +DPP variants this would further gear the attention towards a more incremental reading of the input content elements. Indeed, within the CTF architecture, sparser relevance brought no improvements on the base and +Cov Loss variants but increased the performance of +DPP (with best τ = 0.6).