Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Information Pursuit for Interactively Guiding Large Language Models

Authors: Kwan Ho Ryan Chan, Yuyan Ge, Edgar Dobriban, Hamed Hassani, Rene Vidal

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with 20 Questions show that C-IP obtains better predictive performance and shorter query-answer chains compared to previous approaches to IP and uncertainty-based chain-of-thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the Medi Q dataset, C-IP achieves competitive performance with direct single-turn prediction while offering greater interpretability.
Researcher Affiliation Academia Kwan Ho Ryan Chan Yuyan Ge Edgar Dobriban Hamed Hassani RenΓ© Vidal University of Pennsylvania
Pseudocode Yes Appendix I is the pseudocode for the different algorithms including IP, C-IP and DP.
Open Source Code Yes Our code is available at https://www.github.com/ryanchankh/Conformal Information Pursuit/.
Open Datasets Yes We obtain 20 common animal names from the Animals with Attributes 2 (Aw A2) [136] dataset. ... Finally, we apply our method to the setting of interactive medical question answering on the Medi Q [70] dataset.
Dataset Splits Yes To ensure a fair evaluation, we divide each category into three equal sets: Dest for estimating entropy, Dcal for calibration, and Dtest for test-set evaluation. We perform three-fold cross validation and evaluate the average performance and its standard deviation.
Hardware Specification Yes The experiments are conducted on a workstation of 8 NVIDIA A5000 GPUs.
Software Dependencies Yes All experiments are implemented in Python 3.12. The main packages used are huggingface, Py Torch, Numpy, and Together AI API (for Uo T baseline).
Experiment Setup Yes For LLM hyperparameters during generation, prompts, and further implementation details, refer to Appendix G. ... Unless stated otherwise, we use the default hyperparameters from huggingface. We use the following LLM hyperparameters every time we inference: do_sample=True, temperature=0.7, and max_new_tokens=1024.