TGSum: Build Tweet Guided Multi-Document Summarization Dataset

Authors: Ziqiang Cao, Chengyao Chen, Wenjie Li, Sujian Li, Furu Wei, Ming Zhou

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both informativeness and readability of the collected summaries are verified by manual judgment. In addition, we train a Support Vector Regression summarizer on DUC generic multi-document summarization benchmarks. With the collected data as extra training resource, the performance of the summarizer improves a lot on all the test sets.
Researcher Affiliation Collaboration 1Department of Computing, The Hong Kong Polytechnic University, Hong Kong 2Key Laboratory of Computational Linguistics, Peking University, MOE, China 3Microsoft Research, Beijing, China
Pseudocode No The paper provides mathematical formulations and constraints for the ILP solution but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper states, "We release this dataset for further research1," with footnote 1 pointing to "http://www4.comp.polyu.edu.hk/ cszqcao/". This explicitly refers to the *dataset* being released, not the source code for the methodology or implementation.
Open Datasets Yes For instance, the generic multi-document summarization task aims to summarize a cluster of documents telling the same topic. In this task, the most widely-used datasets are published by Document Understanding Conferences2 (DUC) in 01, 02 and 04. (Footnote 2: http://duc.nist.gov/) and We release this dataset for further research1. (Footnote 1: http://www4.comp.polyu.edu.hk/ cszqcao/)
Dataset Splits No The paper states that DUC datasets are used as test sets and TGSum as an "extra training resource," but it does not provide specific percentages, sample counts, or explicit splitting methodology for training, validation, or test sets for its experiments.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory used for running its experiments.
Software Dependencies No The paper mentions "open Python package newspaper3" and "IBM CPLEX Optimizer5" but does not provide specific version numbers for these or any other software dependencies needed to replicate the experiment.
Experiment Setup No The paper mentions features used for the SVR summarizer (TF, LENGTH, STOP-RATIO) and states the summary length is set to 100 words. However, it does not provide specific hyperparameter values for the SVR model or other detailed training configurations (e.g., learning rates, batch sizes, specific optimizer settings).