Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Document Summarization with Conformal Importance Guarantees

Authors: Bruce Kuwahara, Chen-Yuan Lin, Xiao Shi Huang, Kin Kwan Leung, Jullian Yapeter, Ilya Stanevich, Felipe Perez, Jesse C. Cresswell

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate. Our experiments are designed to validate the conformal guarantee given by Theorem 1, and compare Conformal Importance Summarization to existing summarization methods that do not provide guarantees. We run ablations to understand (i) the influence of both α and β; (ii) the design of the importance score function R(c; x); and, (iii) for datasets without explicit ground-truth labels, the label creation method. Finally, we compare pure abstractive summarization with an LLM to our hybrid extractive-abstractive approach.
Researcher Affiliation	Industry	Bruce Kuwahara Signal 1 AI Toronto, Canada Chen-Yuan Lin Signal 1 AI Toronto, Canada Xiao Shi Huang Signal 1 AI Toronto, Canada Kin Kwan Leung Layer 6 AI Toronto, Canada Jullian Arta Yapeter Signal 1 AI Toronto, Canada Ilya Stanevich Signal 1 AI Toronto, Canada Felipe Perez Signal 1 AI Toronto, Canada Jesse C. Cresswell Layer 6 AI Toronto, Canada
Pseudocode	Yes	Algorithm 1: Greedy Optimization for Extractive-Summarization Labeling Input: Sentences x = [c1, . . . , cp], reference summary r, scoring function V ( ; ), threshold δ Output: Extractive summary y Compute vi = V (ci; r) for all i; Sort indices by descending v giving the permutation π = [π1, . . . , πn]; y curr Vcurr 0 for j = 1 to p do i πj; V (y curr {ci}; r) Vcurr; if > δ then y curr y curr {ci}; Vcurr Vcurr + ; return y curr
Open Source Code	Yes	Code is available at github.com/layer6ai-labs/conformal-importance-summarization. Our codebase is available at github.com/layer6ai-labs/conformal-importancesummarization, including documentation.
Open Datasets	Yes	We use 5 datasets to evaluate the performance of our framework: ECTSum [46] contains complete transcripts from corporate earnings calls, as well as expert-curated extractive summaries at the sentence level; CSDS [40] is a dataset of Chinese language customer-client conversations. Although the summaries are abstractive, each conversation has sentence-level labels for use as an extractive benchmark; CNN/DM [32] covers news sourced from CNN and The Daily Mail with human-written summary sentences; Sci TLDR [11] consists of summaries of scientific papers sourced from both authors and peer-reviewers, and we use two versions where the input is either the full text (TLDRFull), or just the abstract, introduction, and conclusion (TLDR-AIC); MTS-Dialog [7] is a collection of doctor-patient conversations and corresponding summaries intended to cover dialogue material. Datasets are publicly available.
Dataset Splits	Yes	For each dataset, a random subset (n = 100) of all datapoints is sampled to form the calibration dataset. All remaining samples form the test dataset, except for CSDS where we use only the original validation and test sets, and CNN/Daily Mail where we use 900 samples to reduce resource requirements, as shown in Table 1.
Hardware Specification	Yes	The first three used public APIs, while the latter two were hosted locally on a 48 GB A6000 GPU. locally hosted open-source models run on an A6000 GPU with 48 GB of memory.
Software Dependencies	No	The paper mentions specific LLMs (GPT-4o mini, Gemini 2.0 Flash-Lite, Gemini 2.5 Flash, Llama3-8B, Qwen3-8B) and embedding models (SBERT [54]), but does not provide version numbers for general software dependencies or programming languages (e.g., Python, PyTorch).
Experiment Setup	Yes	Our experiments are designed to validate the conformal guarantee given by Theorem 1, and compare Conformal Importance Summarization to existing summarization methods that do not provide guarantees. We run ablations to understand (i) the influence of both α and β; (ii) the design of the importance score function R(c; x); and, (iii) for datasets without explicit ground-truth labels, the label creation method. Finally, we compare pure abstractive summarization with an LLM to our hybrid extractive-abstractive approach. The prompts used are given in Appendix B. For each dataset, a random subset (n = 100) of all datapoints is sampled to form the calibration dataset. All remaining samples form the test dataset, except for CSDS where we use only the original validation and test sets, and CNN/Daily Mail where we use 900 samples to reduce resource requirements, as shown in Table 1.