Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CofCA: A STEP-WISE Counterfactual Multi-hop QA benchmark

Authors: Jian Wu, Linyi Yang, Zhen Wang, Manabu Okumura, Yue Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results reveal a significant performance gap of several LLMs between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks. Moreover, we observe that LLMs usually bypass the correct reasoning chain, showing an inflated multi-step reasoning performance.
Researcher Affiliation	Academia	Jian Wu1 Linyi Yang2 Zhen Wang1 Manabu Okumura1 Yue Zhang3 1Institute of Science Tokyo 2University College London 3School of Engineering, Westlake Univeristy
Pseudocode	No	The paper describes a framework and methodology but does not include any explicitly labeled pseudocode or algorithm blocks. The steps are described in prose and through diagrams.
Open Source Code	No	The Whole Cof CA data are available at https://anonymous.4open.science/r/LLM_inherent_multi_step_eval-3818/. This link provides access to the dataset but does not explicitly state it contains the source code for the methodology described in the paper.
Open Datasets	Yes	The Whole Cof CA data are available at https://anonymous.4open.science/r/LLM_inherent_multi_step_eval-3818/. Besides, we also randomly selected 900 factual MHQA data as the control group (300 from Hotpot QA, 300 from 2Wiki Multihop QA, and 300 from Mu Si Que).
Dataset Splits	Yes	Following the settings of previous LLM evaluation benchmarks (Wang et al., 2023a; 2024), we treat the total of 1800 data as the test set. We evaluate LLMs on the Cof CA benchmark, including 900 randomly selected data from Wikipedia-based MHQA datasets (300 QA pairs from Hotpot QA (Yang et al., 2018), 300 QA pairs from 2Wiki Multihop QA(Ho et al., 2020), 300 QA pairs from Mu Si Que (Trivedi et al., 2021)), and 900 annotated counterfactual MHQA data (divided into 2-hop, 3-hop, and 4-hop subsets).
Hardware Specification	Yes	For open-source models, all experiments are conducted on 8 A100 GPUs.
Software Dependencies	No	The paper mentions several LLMs used for experiments (e.g., GPT-4, GPT-3.5, Llama 2-7b, Mistral-7b, Qwen-7b) and GPT-4-turbo as an evaluator. However, it does not specify version numbers for other core software libraries or programming languages (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup	Yes	To enhance reproducibility, we set the temperature to 0 for proprietary models, and all the experiment results are the average scores of three experiment results. We adopt the proprietary LLMs: GPT-4 (Achiam et al., 2023), GPT-3.5 (Ouyang et al., 2022), text-davinci-003, Bing Chat, GEMINI-pro (Team et al., 2023), and Open Source LLMs such as Llama 2-7b, Mistral-7b and Qwen-7b as the baselines. To decouple LLMs internal memory and reasoning ability, and let LLMs retrieve answers from the given passage as much as possible, we design a prompt that requires LLMs to only retrieve answers based on the given context. The prompt of QA is shown in the Appendix B.