Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents
Authors: Mengna Zhu, Kaisheng Zeng, Mao Wang, Kaiming Xiao, Lei Hou, Hongbin Huang, Juanzi Li
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted comprehensive experiments on Event Sum to evaluate the performance of advanced long-context Large Language Models (LLMs) on this task. Our experimental results indicate that: 1) The event-centric multi-document summarization task remains challenging for existing long-context LLMs; 2) The recall metrics we designed are crucial for evaluating the comprehensiveness of the summary information. |
| Researcher Affiliation | Academia | 1Laboratory for Big Data and Decision, National University of Defense Technology 2Department of Computer Science and Technology, Tsinghua University 3College of Information and Communication, National University of Defense Technology EMAIL |
| Pseudocode | No | The paper describes methods like 'Automatic Data Construction' and 'Human Annotation' in paragraph form and uses figures to illustrate processes, but it does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Code https://github.com/Mzzzhu/Event Sum |
| Open Datasets | Yes | We developed Event Sum, the first large-scale Chinese multi-document summarization dataset, automatically constructed from Baidu Baike entries for this task study. ... All data utilized in this work are publicly available and freely accessible, with no inclusion of proprietary or restricted data. |
| Dataset Splits | Yes | Finally, we obtain 5,100 instances and split them into training, validation, and testing sets. In Event Sum, each instance corresponds to a dynamic event. ... Event Sum Chinese 4,015/500/585 |
| Hardware Specification | No | The paper mentions evaluating various LLMs and NLI models (e.g., glm-4-9b, chinese-roberta-wwm-ext) but does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments or train these models. |
| Software Dependencies | No | The paper mentions using specific models like 'paraphrase-multilingual-mpnet-base-v2' from 'sentence-transformers' and 'glm-4-9b' for LLMs, and 'chinese-roberta-wwm-ext' for NLI. However, it does not provide specific version numbers for the 'sentence-transformers' library itself or other software dependencies like Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | No | The paper states that the assessment was conducted under the "zero-shot setting" and lists the LLMs evaluated. However, it does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs) or other system-level training settings used for their experiments, as it primarily evaluates existing LLMs rather than training new ones from scratch or fine-tuning with specific parameters. |