Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Retrieval is Not Enough: Enhancing RAG through Test-Time Critique and Optimization

Authors: Jiaqi Wei, Hao Zhou, Xiang Zhang, Di Zhang, Zijie Qiu, Noah Wei, Jinzhe Li, Wanli Ouyang, Siqi Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations across seven benchmark datasets and three diverse model families firmly establish ALIGNRAG s state-of-the-art (SOTA) performance, consistently surpassing existing methods on a wide range of tasks. ... Section 4 Experiments
Researcher Affiliation	Academia	Jiaqi Wei1,2 Hao Zhou3 Xiang Zhang4 Di Zhang5 Zijie Qiu5 Wei Wei6 Jinzhe Li2,5 Wanli Ouyang2 Siqi Sun2,5 1 Zhejiang University 2 Shanghai Artificial Intelligence Laboratory 3 South China University of Technology 4 University of British Columbia 5 Fudan University 6 University of Hong Kong EMAIL, EMAIL
Pseudocode	Yes	A.5 Pseudo-code of Novel Algorithms for Critique-Aware Learning ... Algorithm 1 introduces Contrastive Critique Synthesis ... Algorithm 2, Critique Fine-Tuning (CFT) ... Algorithm 3 CRITIQUE PREFERENCE OPTIMIZATION (CPO) ... Algorithm 4 describes Critique-Driven Alignment (CDA)
Open Source Code	Yes	Our source code is provided at Github.
Open Datasets	Yes	Dataset. To train a strong critique generator, we construct a 10K dataset by sampling 2K instances from each of five benchmarks: Pop QA [41], Trivia QA [42], Natural Questions [43], 2Wiki Multihop QA [44], and ASQA [45]. Furthermore, we evaluate our method on the same five in-domain benchmarks, along with two out-of-distribution (OOD) tasks, i.e., Hotpot QA [46] and SQu AD [47].
Dataset Splits	Yes	To train a strong critique generator, we construct a 10K dataset by sampling 2K instances from each of five benchmarks... Hierarchy-1: Not Relevant, Not Helpful, Not Complete. We sample 200 instances per benchmark... Hierarchy-2: Relevant, Not Helpful, Not Complete. We sample 400 instances per benchmark... Hierarchy-3 & 4: Relevant, Helpful, Not Complete / Complete. We sample 1,400 instances per benchmark where the context is both relevant and helpful...
Hardware Specification	Yes	We fine-tune our models using the Lo RA method on 2 NVIDIA A100 GPUs, each with 80GB of memory.
Software Dependencies	No	The paper mentions software components like "Lo RA method," "Adam W optimizer," and "bf16 (brain floating point) precision," but it does not specify any version numbers for these software libraries or tools. The programming language used (e.g., Python) is also not specified with a version.
Experiment Setup	Yes	Training Details. We fine-tune our models using the Lo RA method on 2 NVIDIA A100 GPUs... over 2 epochs with a learning rate of 1e-5 using the Adam W optimizer and employs a per-device batch size of 16, leveraging gradient accumulation... We set the Lo RA-specific hyperparameters as follows: lora_rank = 16 and lora_alpha = 64... The sequence cutoff length is 6144 tokens, with a warmup ratio of 0.1 applied... Additionally, we utilize bf16 (brain floating point) precision...