Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Breaking the Gradient Barrier: Unveiling Large Language Models for Strategic Classification

Authors: Xinpeng Lv, Yunxin Mao, Haoxuan Li, KE LIANG, Jinxuan Yang, Wanrong Huang, Haoang Chi, Huan Chen, Long Lan, Cyuanlong, Wenjing Yang, Haotian Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach through experiments with a collection of pre-trained LLMs on real-world and synthetic datasets in financial and internet domains, demonstrating that our GLIM exhibits both robustness and efficiency, and offering an effective solution for large-scale SC tasks.
Researcher Affiliation	Academia	Xinpeng Lv1, Yunxin Mao1, Haoxuan Li2, Ke Liang1, Jinxuan Yang3, Wanrong Huang1, Haoang Chi1, Huan Chen1, Long Lan1, Yuanlong Chen4, Wenjing Yang1, Haotian Wang1 1College of Computer Science and Technology, National University of Defense Technology, Changsha, China 2Center for Data Science, Peking University, Beijing, China 3Faculty of Engineering, the University of Sydney, Sydney, Australia 4Faculty of Computing, Harbin Institute of Technology, Harbin, China EMAIL
Pseudocode	No	The paper provides theoretical justifications and mathematical formulations (e.g., Lemma 1, Proposition 1, Proposition 2) but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	We use LLM APIs and provide open access to our preprocessed data, intermediate datasets, and part of our code.
Open Datasets	Yes	We evaluate our method on six benchmark datasets, comprising five real-world datasets and one synthetic dataset: Large-scale datasets: CISFraud [63], a large-scale transactional dataset provided by IEEE and an international bank for fraud detection. Phi USIIL [55], a phishing URL detection dataset reflecting adversarial evasion scenarios in cybersecurity. Synthetic [46], a synthetic dataset generated using the Pay Sim simulator, which mimics mobile financial transactions and fraud patterns based on real-world data. Small-scale datasets: Adult [4], a census dataset for predicting whether an individual s income. Spam [40], a text-based dataset for binary classification of email messages as spam or not. Credit [79], a credit scoring dataset used for predicting the risk of credit default in consumer finance scenarios.
Dataset Splits	Yes	Each method is subjected to 10-fold cross-validation, and the average results are presented in Table 2.
Hardware Specification	Yes	A key practical limitation of our approach is its reliance on proprietary large language models that are accessed via commercial APIs, such as Open AI GPT-4o and Deep Seek. Unlike traditional machine learning models, which can be trained or deployed locally with a fixed hardware budget, our method depends on repeated calls to remote LLM API services.
Software Dependencies	No	For the baseline method, we employ a linear regression model as a reference classifier, optimizing it through gradient descent. In GLIM, we mainly utilize the pre-trained LLM APIs, e.g., GPT4o [53], and refine its responses through in-context learning. We also conducted experiments on Claude [3], Mixtral [38], Deep Seek [45],Gemini [65], Qwen3 [13], and LLama [49]. Official SDKs (e.g., openai, anthropic, google-cloud-aiplatform) are used to interface with model APIs. However, specific version numbers for these APIs or SDKs are not provided.
Experiment Setup	Yes	Each method is subjected to 10-fold cross-validation, and the average results are presented in Table 2. Detailed implementation specifics are provided in Appendix J. All models are run with consistent hyperparameters across experiments. Prompt formatting is standardized to minimize variance due to stylistic differences in input-output formatting. Appendix J.3 provides extensive details on prompt design, task definition, in-context examples, and batch evaluation for the LLMs.