Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning Elicitation in Language Models via Counterfactual Feedback

Authors: Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya Nori, Javier Hernandez

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities. (...) 5 EXPERIMENTS
Researcher Affiliation	Collaboration	Alihan H uy uk, Xinnuo Xu, Jacqueline Maasch, Aditya V. Nori, Javier Gonz alez Harvard University, Microsoft Research Cambridge, Cornell Tech
Pseudocode	Yes	Algorithm 1 Supervised Counterfactual Feedback (...) Algorithm 2 Preference-based Counterfactual Feedback (...) Algorithm 3 Preference-based Causal Consistency Feedback
Open Source Code	No	No explicit statement regarding the release of their own source code for the methodology is provided. The paper references third-party models like Phi-3 mini from Hugging Face but does not offer a link to their specific implementation.
Open Datasets	Yes	We present three real-world causal reasoning problems: in the Healthcare domain, we examine breast cancer treatment and develop a simplified problem that determines how different treatment options namely, radiotherapy/chemotherapy and surgery are assigned to patients based on cancer type, tumor size, and nodal involvement. This model is grounded in a real-world guideline (MD Anderson Cancer Center) and published statistics on the disease (Orrantia Borunda et al., 2022; Sezgın et al., 2020; Carey et al., 2006). In the Engineering domain, we implement an automatic fault detection algorithm for transmission lines (Reddy et al., 2016). (...) In the Math Benchmarking domain, we select a math question from GSM8K (Cobbe et al., 2021), a widely used benchmark for evaluating language models on grade school math problems.
Dataset Splits	No	The paper describes generating datasets for fine-tuning and evaluation scenarios (in-domain, generalization modes) by sampling '100 contexts per causal relationship' and generating '10 answers for each question per context'. It mentions 'held-out set of test samples' and distinct training and evaluation phases. However, it does not provide specific numerical splits (e.g., percentages or exact counts for train/validation/test from a fixed dataset) in the traditional sense, but rather describes how different types of causal relationships are used for training versus testing.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments, such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions using specific language models like 'Phi-3 mini' and 'Llama 3 8B' and refers to Hugging Face models. It does not list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) that would be needed for replication.
Experiment Setup	No	The paper describes the fine-tuning methods (SFT, DPO, DPO+CCF) and data generation strategies, but it does not provide specific hyperparameter values such as learning rate, batch size, number of epochs, or optimizer settings used for the fine-tuning process.