Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CofCA: A STEP-WISE Counterfactual Multi-hop QA benchmark

Authors: Jian Wu, Linyi Yang, Zhen Wang, Manabu Okumura, Yue Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results reveal a significant performance gap of several LLMs between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks. Moreover, we observe that LLMs usually bypass the correct reasoning chain, showing an inflated multi-step reasoning performance.
Researcher Affiliation Academia Jian Wu1 Linyi Yang2 Zhen Wang1 Manabu Okumura1 Yue Zhang3 1Institute of Science Tokyo 2University College London 3School of Engineering, Westlake Univeristy
Pseudocode No The paper describes a framework and methodology but does not include any explicitly labeled pseudocode or algorithm blocks. The steps are described in prose and through diagrams.
Open Source Code No The Whole Cof CA data are available at https://anonymous.4open.science/r/LLM_inherent_multi_step_eval-3818/. This link provides access to the dataset but does not explicitly state it contains the source code for the methodology described in the paper.
Open Datasets Yes The Whole Cof CA data are available at https://anonymous.4open.science/r/LLM_inherent_multi_step_eval-3818/. Besides, we also randomly selected 900 factual MHQA data as the control group (300 from Hotpot QA, 300 from 2Wiki Multihop QA, and 300 from Mu Si Que).
Dataset Splits Yes Following the settings of previous LLM evaluation benchmarks (Wang et al., 2023a; 2024), we treat the total of 1800 data as the test set. We evaluate LLMs on the Cof CA benchmark, including 900 randomly selected data from Wikipedia-based MHQA datasets (300 QA pairs from Hotpot QA (Yang et al., 2018), 300 QA pairs from 2Wiki Multihop QA(Ho et al., 2020), 300 QA pairs from Mu Si Que (Trivedi et al., 2021)), and 900 annotated counterfactual MHQA data (divided into 2-hop, 3-hop, and 4-hop subsets).
Hardware Specification Yes For open-source models, all experiments are conducted on 8 A100 GPUs.
Software Dependencies No The paper mentions several LLMs used for experiments (e.g., GPT-4, GPT-3.5, Llama 2-7b, Mistral-7b, Qwen-7b) and GPT-4-turbo as an evaluator. However, it does not specify version numbers for other core software libraries or programming languages (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes To enhance reproducibility, we set the temperature to 0 for proprietary models, and all the experiment results are the average scores of three experiment results. We adopt the proprietary LLMs: GPT-4 (Achiam et al., 2023), GPT-3.5 (Ouyang et al., 2022), text-davinci-003, Bing Chat, GEMINI-pro (Team et al., 2023), and Open Source LLMs such as Llama 2-7b, Mistral-7b and Qwen-7b as the baselines. To decouple LLMs internal memory and reasoning ability, and let LLMs retrieve answers from the given passage as much as possible, we design a prompt that requires LLMs to only retrieve answers based on the given context. The prompt of QA is shown in the Appendix B.