Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Conformal Information Pursuit for Interactively Guiding Large Language Models
Authors: Kwan Ho Ryan Chan, Yuyan Ge, Edgar Dobriban, Hamed Hassani, Rene Vidal
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with 20 Questions show that C-IP obtains better predictive performance and shorter query-answer chains compared to previous approaches to IP and uncertainty-based chain-of-thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the Medi Q dataset, C-IP achieves competitive performance with direct single-turn prediction while offering greater interpretability. |
| Researcher Affiliation | Academia | Kwan Ho Ryan Chan Yuyan Ge Edgar Dobriban Hamed Hassani RenΓ© Vidal University of Pennsylvania |
| Pseudocode | Yes | Appendix I is the pseudocode for the different algorithms including IP, C-IP and DP. |
| Open Source Code | Yes | Our code is available at https://www.github.com/ryanchankh/Conformal Information Pursuit/. |
| Open Datasets | Yes | We obtain 20 common animal names from the Animals with Attributes 2 (Aw A2) [136] dataset. ... Finally, we apply our method to the setting of interactive medical question answering on the Medi Q [70] dataset. |
| Dataset Splits | Yes | To ensure a fair evaluation, we divide each category into three equal sets: Dest for estimating entropy, Dcal for calibration, and Dtest for test-set evaluation. We perform three-fold cross validation and evaluate the average performance and its standard deviation. |
| Hardware Specification | Yes | The experiments are conducted on a workstation of 8 NVIDIA A5000 GPUs. |
| Software Dependencies | Yes | All experiments are implemented in Python 3.12. The main packages used are huggingface, Py Torch, Numpy, and Together AI API (for Uo T baseline). |
| Experiment Setup | Yes | For LLM hyperparameters during generation, prompts, and further implementation details, refer to Appendix G. ... Unless stated otherwise, we use the default hyperparameters from huggingface. We use the following LLM hyperparameters every time we inference: do_sample=True, temperature=0.7, and max_new_tokens=1024. |