Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search
Authors: Yanbo Wang, Zixiang Xu, Yue Huang, Gao Chujie, Siyuan Wu, Jiayi Ye, Pin-Yu Chen, Xiuying Chen, Xiangliang Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted comprehensive experiments to validate our framework, evaluating its effectiveness across four benchmark datasets, namely MMLU, Commonsense QA, Openbook QA, and Truthful QA, on a diverse set of mainstream models, including proprietary and open-weight architectures. Our results show that adaptive distractions cause significant performance degradation, with an average accuracy drop exceeding 45%, exposing vulnerabilities in even the most advanced LLMs. Experiments on four benchmarks demonstrate that the generated distractions lead to an average performance drop of over 45% for mainstream models. |
| Researcher Affiliation | Collaboration | 1Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) 2University of Notre Dame 3IBM Research |
| Pseudocode | Yes | We show the overall algorithm in Algorithm 1. |
| Open Source Code | Yes | The code is publicly available at https://github.com/wyf23187/Adaptive_Distractions. |
| Open Datasets | Yes | We selected four widely used benchmarks to evaluate contextual robustness under adaptive distraction: MMLU [16, 17], Commonsense QA [18], Openbook QA [19], and Truthful QA [20]. |
| Dataset Splits | Yes | We used 1200 original questions from 4 datasets, splitting them into training, test, and validation sets. Specifically, 80 percent of the data was allocated to training, with 10 percent of the training set reserved for validation, and the remaining 10 percent was used for testing. ... The data was split into training, validation, and test sets, with 80 percent of the data used for training, 10 percent of the training set reserved for validation, and 20 percent allocated to testing. |
| Hardware Specification | Yes | The training was conducted on a single RTX 4090 GPU, with a learning rate set to 1e-4 and a total of five epochs. ... the fine-tuning was performed on two RTX 4090 GPUs with a learning rate set to 2e-4 and five epochs. |
| Software Dependencies | No | The paper does not explicitly provide specific version numbers for ancillary software components (e.g., Python, PyTorch, CUDA) used in the experiments. |
| Experiment Setup | Yes | We set the temperature to 0.7 during the distraction generation phase to encourage more diverse and challenging outputs. For evaluation, we lowered the temperature to 0.001 to ensure response consistency, with a maximum output length of 1,024 tokens. Additionally, we set α = 2 and γ = 1 for the value function used in the tree search. For other detailed hyperparameter settings, please refer to Appendix B.1. ... The training was conducted on a single RTX 4090 GPU, with a learning rate set to 1e-4 and a total of five epochs. ... the fine-tuning was performed on two RTX 4090 GPUs with a learning rate set to 2e-4 and five epochs. The preference loss was implemented with a sigmoid activation function. |