Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Precise Information Control in Long-Form Text Generation

Authors: Jacqueline He, Howard Yen, Margaret Li, Stella Li, Zhiyuan Zeng, Weijia Shi, Yulia Tsvetkov, Danqi Chen, Pang Wei W Koh, Luke Zettlemoyer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, without adding any unsupported ones. PIC includes a full setting that tests a model s ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still hallucinate against user-provided input in over 70% of generations. To alleviate this lack of faithfulness, we introduce a post-training framework that uses a weakly supervised preference data construction method to train an 8B PIC-LM with stronger PIC ability improving from 69.1% to 91.0% F1 in the full PIC setting.
Researcher Affiliation	Academia	ωPaul G. Allen School of Computing Science & Engineering, University of Washington πPrinceton Language and Intelligence (PLI), Princeton University αAllen Institute for AI
Pseudocode	Yes	Algorithm 1 sketches out the protocol used for preference data creation. Note that for simplicity, the algorithm assumes a set τ and pmax for the entire dataset; in practice, we assigned different values of τ and pmax based on the PIC setting.
Open Source Code	Yes	We release all artifacts (PIC-Bench, PIC-LM 8B models, and related data processing, training, and evaluation scripts) for reproducibility and future development: Codebase jacqueline-he/precise-information-control
Open Datasets	Yes	We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. PIC-Bench Data jacquelinehe/pic-bench
Dataset Splits	Yes	We design PIC-Bench, a benchmark of six full and two partial PIC tasks.4 We choose eight long-form generation datasets from prior work [81, 77, 32, 105, 42] and re-frame their evaluation to our problem formulation. Table 1 shows PIC-Bench task information (additional details in Appendix B.1). For PIC-Bench evaluation, the biography and QA datasets are in-distribution (ID), whereas the other datasets are out-of-domain (OOD), allowing us to assess performance in both settings. We ensure that no train-test overlap exists.
Hardware Specification	Yes	We conduct training with full parameters using the accelerate package [36] and Deep Speed Zero-3 Offload [96] on a cluster of 40GB NVIDIA L40 and A40 GPUs.
Software Dependencies	No	The paper mentions the use of specific software packages like "accelerate package [36]" and "Deep Speed Zero-3 Offload [96]", and the "v LLM toolkit [51]", but does not provide specific version numbers for these software dependencies within the text. It also mentions LLM models like "GPT-4o mini" and "Claude 3.5 Sonnet" but these are models/services, not installable software with version numbers in the traditional sense of a software dependency list.
Experiment Setup	Yes	For SFT, we use a batch size of 128 and a learning rate of 1e 5, and train for 2 epochs. For DPO, we use a learning rate of 1e 6, a batch size of 128, a weight decay of 0.1, a β of 5, and train for 1 epoch.