Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Boosting Resilience of Large Language Models through Causality-Driven Robust Optimization

Authors: Xiaoling Zhou, Mingjie Zhang, Zhemg Lee, YUNCHENG HUA, chengli xing, Wei Ye, Flora D. Salim, Shikun Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across various tasks using twelve different LLMs demonstrate the superior performance of our framework, underscoring its significant effectiveness in reducing the model s dependence on spurious associations and mitigating hallucinations. Extensive experiments have been conducted on both natural language understanding (NLU) and natural language generation (NLG) tasks, leveraging twelve different LLMs with varying parameter sizes.
Researcher Affiliation	Academia	Xiaoling Zhou Peking University EMAIL Mingjie Zhang Peking University EMAIL Zhemg Lee Tianjin University EMAIL Yuncheng Hua University of New South Wales EMAIL Chengli Xing Peking University EMAIL Wei Ye Peking University EMAIL Flora D. Salim University of New South Wales EMAIL Shikun Zhang Peking University EMAIL
Pseudocode	No	The paper describes the methodology in prose, detailing steps like data collection, parameter localization, and optimization, but does not present these steps in a structured pseudocode or algorithm block.
Open Source Code	No	All datasets utilized in this study are publicly available and our code will be made publicly available upon acceptance of the paper.
Open Datasets	Yes	All datasets utilized in this study are publicly available and our code will be made publicly available upon acceptance of the paper. Experiments are conducted on three downstream tasks: SST-2 for sentiment classification [75], Co LA for grammatical acceptability judgment [90], and QNLI for question answering [69]. We evaluate our approach on five representative NLG benchmarks: Natural Questions (NQ) [40], Sci Q [92], Trivia QA [32], Truthful QA [47], and Wiki QA [100]. The OOD evaluation is conducted on the IMDB-Cont [21] and IMDB-CAD [37] datasets. The OOD datasets for MNLI comprise HANS [99] and Adv NLI [63], while PAWS-QQP [104] serves as the OOD dataset for QQP.
Dataset Splits	No	Due to space limitations, further details regarding the datasets, the compared baselines, and the experimental settings are provided in the Appendix. The main text references standard benchmarks but does not explicitly state the specific training/test/validation splits used for these datasets within the visible text.
Hardware Specification	No	Details regarding the computational resources are provided in Section 4 and the Appendix. However, the main body of the paper does not contain specific hardware details like GPU/CPU models or memory amounts.
Software Dependencies	No	The paper mentions various LLM models used as backbones (e.g., LLa MA-3-70B [23], GPT4o [30], BERT-base [17], ALBERT-large [41], Ro BERTa-base [51], LLa MA-2-7B [81], GPT-2 XL (1.5B) [67], GPT-J (6B) [86], LLa MA-7B [80], LLa MA-30B, LLa MA-2-13B, LLa MA-3-8B [23], and Vicuna-13B [13]) but does not provide specific software dependencies with version numbers like programming language versions or library versions (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup	Yes	the hyperparameter ϵ serves as a small constant that limits the extent of permissible ratio variation. Moreover, At denotes the advantage estimation for token t, computed as At = r(x, y) α i=t log πθc old θold(yt\|x, y<t) πref(yt\|x, y<t) + γ B rank(r(x, y)) where α and γ are two hyperparameters. Accordingly, the reward employed during optimization is defined as a weighted sum of the four reward components: r = ra + λ(ro rc + rf), where the value of λ is fixed as 0.5 in our experiments to maintain the relative dominance of the accuracy-related reward component.