Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Information Pursuit for Interactively Guiding Large Language Models

Authors: Kwan Ho Ryan Chan, Yuyan Ge, Edgar Dobriban, Hamed Hassani, Rene Vidal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments with 20 Questions show that C-IP obtains better predictive performance and shorter query-answer chains compared to previous approaches to IP and uncertainty-based chain-of-thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the Medi Q dataset, C-IP achieves competitive performance with direct single-turn prediction while offering greater interpretability.
Researcher Affiliation	Academia	Kwan Ho Ryan Chan Yuyan Ge Edgar Dobriban Hamed Hassani René Vidal University of Pennsylvania
Pseudocode	Yes	Appendix I is the pseudocode for the different algorithms including IP, C-IP and DP.
Open Source Code	Yes	Our code is available at https://www.github.com/ryanchankh/Conformal Information Pursuit/.
Open Datasets	Yes	We obtain 20 common animal names from the Animals with Attributes 2 (Aw A2) [136] dataset. ... Finally, we apply our method to the setting of interactive medical question answering on the Medi Q [70] dataset.
Dataset Splits	Yes	To ensure a fair evaluation, we divide each category into three equal sets: Dest for estimating entropy, Dcal for calibration, and Dtest for test-set evaluation. We perform three-fold cross validation and evaluate the average performance and its standard deviation.
Hardware Specification	Yes	The experiments are conducted on a workstation of 8 NVIDIA A5000 GPUs.
Software Dependencies	Yes	All experiments are implemented in Python 3.12. The main packages used are huggingface, Py Torch, Numpy, and Together AI API (for Uo T baseline).
Experiment Setup	Yes	For LLM hyperparameters during generation, prompts, and further implementation details, refer to Appendix G. ... Unless stated otherwise, we use the default hyperparameters from huggingface. We use the following LLM hyperparameters every time we inference: do_sample=True, temperature=0.7, and max_new_tokens=1024.