Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CofCA: A STEP-WISE Counterfactual Multi-hop QA benchmark
Authors: Jian Wu, Linyi Yang, Zhen Wang, Manabu Okumura, Yue Zhang
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results reveal a significant performance gap of several LLMs between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks. Moreover, we observe that LLMs usually bypass the correct reasoning chain, showing an inflated multi-step reasoning performance. |
| Researcher Affiliation | Academia | Jian Wu1 Linyi Yang2 Zhen Wang1 Manabu Okumura1 Yue Zhang3 1Institute of Science Tokyo 2University College London 3School of Engineering, Westlake Univeristy |
| Pseudocode | No | The paper describes a framework and methodology but does not include any explicitly labeled pseudocode or algorithm blocks. The steps are described in prose and through diagrams. |
| Open Source Code | No | The Whole Cof CA data are available at https://anonymous.4open.science/r/LLM_inherent_multi_step_eval-3818/. This link provides access to the dataset but does not explicitly state it contains the source code for the methodology described in the paper. |
| Open Datasets | Yes | The Whole Cof CA data are available at https://anonymous.4open.science/r/LLM_inherent_multi_step_eval-3818/. Besides, we also randomly selected 900 factual MHQA data as the control group (300 from Hotpot QA, 300 from 2Wiki Multihop QA, and 300 from Mu Si Que). |
| Dataset Splits | Yes | Following the settings of previous LLM evaluation benchmarks (Wang et al., 2023a; 2024), we treat the total of 1800 data as the test set. We evaluate LLMs on the Cof CA benchmark, including 900 randomly selected data from Wikipedia-based MHQA datasets (300 QA pairs from Hotpot QA (Yang et al., 2018), 300 QA pairs from 2Wiki Multihop QA(Ho et al., 2020), 300 QA pairs from Mu Si Que (Trivedi et al., 2021)), and 900 annotated counterfactual MHQA data (divided into 2-hop, 3-hop, and 4-hop subsets). |
| Hardware Specification | Yes | For open-source models, all experiments are conducted on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions several LLMs used for experiments (e.g., GPT-4, GPT-3.5, Llama 2-7b, Mistral-7b, Qwen-7b) and GPT-4-turbo as an evaluator. However, it does not specify version numbers for other core software libraries or programming languages (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | To enhance reproducibility, we set the temperature to 0 for proprietary models, and all the experiment results are the average scores of three experiment results. We adopt the proprietary LLMs: GPT-4 (Achiam et al., 2023), GPT-3.5 (Ouyang et al., 2022), text-davinci-003, Bing Chat, GEMINI-pro (Team et al., 2023), and Open Source LLMs such as Llama 2-7b, Mistral-7b and Qwen-7b as the baselines. To decouple LLMs internal memory and reasoning ability, and let LLMs retrieve answers from the given passage as much as possible, we design a prompt that requires LLMs to only retrieve answers based on the given context. The prompt of QA is shown in the Appendix B. |