PEAK: Pyramid Evaluation via Automated Knowledge Extraction
Authors: Qian Yang, Rebecca Passonneau, Gerard de Melo
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that PEAK scores correlate very well with the manual Pyramid method. We even encounter examples where PEAK fares better than humans in assigning weights to SCUs. For the score comparison, we use the raw (non-normalized) scores. Table 1 gives the correlations between scores based on P and scores based on one of the two manual pyramids P1 and P2, which were created by different annotators who worked independently with the same five reference summaries. |
| Researcher Affiliation | Academia | Qian Yang IIIS, Tsinghua University Beijing, China laraqianyang@gmail.com Rebecca J. Passonneau CCLS, Columbia University New York, NY, USA becky@ccls.columbia.edu Gerard de Melo IIIS, Tsinghua University Beijing, China gdm@demelo.org |
| Pseudocode | Yes | Algorithm 1 Merge similar SCUs and Algorithm 2 Computing scores for target summaries |
| Open Source Code | Yes | A distributable code package is available at http://www.larayang.com/peak/. |
| Open Datasets | Yes | Student Summaries Our experiments focus on a student summary dataset from Perin et al. (2013) with twenty target summaries written by students. Machine-Generated Summaries For further validation, we also conducted an additional experiment on data from the 2006 Document Understanding Conference (DUC) administered by NIST ( DUC06 ). |
| Dataset Splits | No | The paper mentions using five reference model summaries to generate a pyramid and then scoring twenty student summaries and twenty-two machine-generated summaries. It also discusses correlations with manually created pyramids. However, it does not explicitly define traditional train/validation/test splits for the data used *by their model* (PEAK) in the typical machine learning sense of partitioning a dataset into these subsets for model training, hyperparameter tuning, and final evaluation. The 'reference summaries' are inputs to build the pyramid, and 'target summaries' are used for evaluation. |
| Hardware Specification | No | The paper does not provide any specific hardware details used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Claus IE system (Del Corro and Gemulla 2013)' for Open IE and 'Align, Disambiguate and Walk (ADW) (Pilehvar, Jurgens, and Navigli 2013)' for semantic similarity. While these are software components, no specific version numbers for these, or any other software dependencies, are provided. |
| Experiment Setup | Yes | A pair of nodes u and v will have an edge if and only if their similarity sim(u, v) t. We picked the midpoint of 0.5 as the threshold t for two nodes to be more similar than not. So we set dmin to 3, which is slightly bigger than the midpoint of the regular maximum weight, meaning that nodes with degree 3 are chosen as salient. In our experiments, T1 is fixed at 0.8. In our experiments, T is fixed to 0.6. |