PEAK: Pyramid Evaluation via Automated Knowledge Extraction

Authors: Qian Yang, Rebecca Passonneau, Gerard de Melo

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that PEAK scores correlate very well with the manual Pyramid method. We even encounter examples where PEAK fares better than humans in assigning weights to SCUs. For the score comparison, we use the raw (non-normalized) scores. Table 1 gives the correlations between scores based on P and scores based on one of the two manual pyramids P1 and P2, which were created by different annotators who worked independently with the same five reference summaries.
Researcher Affiliation Academia Qian Yang IIIS, Tsinghua University Beijing, China laraqianyang@gmail.com Rebecca J. Passonneau CCLS, Columbia University New York, NY, USA becky@ccls.columbia.edu Gerard de Melo IIIS, Tsinghua University Beijing, China gdm@demelo.org
Pseudocode Yes Algorithm 1 Merge similar SCUs and Algorithm 2 Computing scores for target summaries
Open Source Code Yes A distributable code package is available at http://www.larayang.com/peak/.
Open Datasets Yes Student Summaries Our experiments focus on a student summary dataset from Perin et al. (2013) with twenty target summaries written by students. Machine-Generated Summaries For further validation, we also conducted an additional experiment on data from the 2006 Document Understanding Conference (DUC) administered by NIST ( DUC06 ).
Dataset Splits No The paper mentions using five reference model summaries to generate a pyramid and then scoring twenty student summaries and twenty-two machine-generated summaries. It also discusses correlations with manually created pyramids. However, it does not explicitly define traditional train/validation/test splits for the data used *by their model* (PEAK) in the typical machine learning sense of partitioning a dataset into these subsets for model training, hyperparameter tuning, and final evaluation. The 'reference summaries' are inputs to build the pyramid, and 'target summaries' are used for evaluation.
Hardware Specification No The paper does not provide any specific hardware details used for running the experiments.
Software Dependencies No The paper mentions using 'Claus IE system (Del Corro and Gemulla 2013)' for Open IE and 'Align, Disambiguate and Walk (ADW) (Pilehvar, Jurgens, and Navigli 2013)' for semantic similarity. While these are software components, no specific version numbers for these, or any other software dependencies, are provided.
Experiment Setup Yes A pair of nodes u and v will have an edge if and only if their similarity sim(u, v) t. We picked the midpoint of 0.5 as the threshold t for two nodes to be more similar than not. So we set dmin to 3, which is slightly bigger than the midpoint of the regular maximum weight, meaning that nodes with degree 3 are chosen as salient. In our experiments, T1 is fixed at 0.8. In our experiments, T is fixed to 0.6.