Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents

Authors: Mengna Zhu, Kaisheng Zeng, Mao Wang, Kaiming Xiao, Lei Hou, Hongbin Huang, Juanzi Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted comprehensive experiments on Event Sum to evaluate the performance of advanced long-context Large Language Models (LLMs) on this task. Our experimental results indicate that: 1) The event-centric multi-document summarization task remains challenging for existing long-context LLMs; 2) The recall metrics we designed are crucial for evaluating the comprehensiveness of the summary information.
Researcher Affiliation	Academia	1Laboratory for Big Data and Decision, National University of Defense Technology 2Department of Computer Science and Technology, Tsinghua University 3College of Information and Communication, National University of Defense Technology EMAIL
Pseudocode	No	The paper describes methods like 'Automatic Data Construction' and 'Human Annotation' in paragraph form and uses figures to illustrate processes, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Code https://github.com/Mzzzhu/Event Sum
Open Datasets	Yes	We developed Event Sum, the first large-scale Chinese multi-document summarization dataset, automatically constructed from Baidu Baike entries for this task study. ... All data utilized in this work are publicly available and freely accessible, with no inclusion of proprietary or restricted data.
Dataset Splits	Yes	Finally, we obtain 5,100 instances and split them into training, validation, and testing sets. In Event Sum, each instance corresponds to a dynamic event. ... Event Sum Chinese 4,015/500/585
Hardware Specification	No	The paper mentions evaluating various LLMs and NLI models (e.g., glm-4-9b, chinese-roberta-wwm-ext) but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or train these models.
Software Dependencies	No	The paper mentions using specific models like 'paraphrase-multilingual-mpnet-base-v2' from 'sentence-transformers' and 'glm-4-9b' for LLMs, and 'chinese-roberta-wwm-ext' for NLI. However, it does not provide specific version numbers for the 'sentence-transformers' library itself or other software dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup	No	The paper states that the assessment was conducted under the "zero-shot setting" and lists the LLMs evaluated. However, it does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs) or other system-level training settings used for their experiments, as it primarily evaluates existing LLMs rather than training new ones from scratch or fine-tuning with specific parameters.