Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reliable Decision‑Making via Calibration‑Oriented Retrieval‑Augmented Generation

Authors: Chaeyun Jang, Deukhwan Cho, Seanie Lee, Hyungi Lee, Juho Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Then we empirically validate that Calib RAG improves calibration performance as well as accuracy, compared to other baselines across various datasets.
Researcher Affiliation Academia Chaeyun Jang KAIST EMAIL Deukhwan Cho KAIST EMAIL Seanie Lee KAIST EMAIL Hyungi Lee Kookmin University EMAIL Juho Lee KAIST EMAIL
Pseudocode No The paper does not present formal theoretical results, such as theorems or proofs. While it includes mathematical formulations(1, 2, 3, 4) to define the calibration objective and forecasting function, these are used to describe the method rather than to prove theoretical guarantees.
Open Source Code Yes Code and data are included in the supplementary materials, with full instructions to reproduce the main results.
Open Datasets Yes To conduct the supervised learning discussed in Sec. 3.2, it is essential to construct an appropriate synthetic training dataset S consisting of the triples (t, q, d, b). We first extract the (x, y) decisionmaking task pairs from the following three Question Answering datasets: 1) Trivia QA [37], 2) SQu AD2.0 [38], and 3) Wiki QA [39] datasets. ... Evaluation Datasets For zero-shot evaluation, we employ several datasets covering diverse domains and question types. Hotpot QA [51] is a multi-hop question-answering dataset... Web QA [47] is an open-domain question-answering dataset... Natural Questions (NQ) [46] is another large-scale questionanswering dataset... We also evaluate domain-specific datasets, including Bio ASQ [57], a biomedical QA dataset... as well as Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE) [49]
Dataset Splits Yes For all experiments, following Sec. 3.3, we collect a total of 20,870 samples for training and 4,125 for validation. All evaluations are conducted in a zero-shot setting on held-out tasks that are disjoint from both the training and validation sets.
Hardware Specification Yes Our experiments are conducted on NVIDIA RTX 3090 and RTX A6000 GPUs.
Software Dependencies Yes Our implementation builds on key libraries such as Py Torch 2.1.2 [54], Hugging Face Transformers 4.45.1 [55], and PEFT 0.7.1
Experiment Setup Yes Table 2 outlines the hyperparameters used for training the base model and Lo RA, including key parameters such as learning rate, batch size, and Lo RA-specific settings like rank and alpha.