Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Dependency Matters: Enhancing LLM Reasoning with Explicit Knowledge Grounding
Authors: Xiangyu Wen, Min Li, Junhua Huang, Jianyuan Zhong, Zhijian Xu, Zeju Li, Yongxiang Huang, Mingxuan Yuan, Qiang Xu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across diverse reasoning benchmarks including Strategy QA, Commonsense QA, GPQA, and Truthful QA demonstrate that GRi D substantially improves reasoning accuracy, consistency, and faithfulness compared to recent state-of-the-art structured reasoning methods. |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong 2Huawei Noah s Ark Lab 3Southeast University 4Huawei Hong Kong Research Center |
| Pseudocode | Yes | Algorithm 1: GRi D: Data Curation and Dependency Verification |
| Open Source Code | Yes | Code is available at: https://github.com/cure-lab/GRi D. |
| Open Datasets | Yes | We evaluate models across four distinct QA benchmarks, each targeting specific reasoning and knowledge retrieval skills. Strategy QA [12] tests implicit multi-step reasoning...; Commonsense QA [34] assesses common-sense knowledge application...; GPQA [28] evaluates advanced STEM knowledge...; and Truthful QA [24] measures a model s ability to avoid common misconceptions... |
| Dataset Splits | Yes | For GPQA, we use the GPQA-Diamond subset exclusively for testing, while the GPQA-Extended subset, which has excluded the samples from GPQA-Diamond, is used for training. For Truthful QA, we randomly split the dataset into training and testing sets using an 8:2 ratio, with 160 samples selected for the test set. |
| Hardware Specification | Yes | conduct all experiments using NVIDIA L40 GPUs and Intel(R) Xeon(R) Gold 6426Y CPUs. |
| Software Dependencies | No | The paper mentions Lo RA for fine-tuning and various LLMs (Llama, Qwen, GPTs, Deep Seek) but does not provide specific version numbers for underlying software libraries like PyTorch, TensorFlow, or Python. |
| Experiment Setup | Yes | Table 8: Settings for model fine-tuning. Batch size: 1 Epochs: 10/6/15/10 Warmup Ratio: 0.17 Learning Rate: 5e-5/1e-4 Gradient Accumulation Steps: 8 Lora R: 16 Lora Alpha: 16 Lora Dropout: 0.01 |