Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

System Prompt Optimization with Meta-Learning

Authors: Yumin Choi, Jinheon Baek, Sung Ju Hwang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.
Researcher Affiliation	Collaboration	KAIST1, Deep Auto.ai2 EMAIL
Pseudocode	Yes	We present the Meta SPO algorithm, which is composed of alternatives between an Inner Loop and an Outer Loop. Algorithm 1 Meta SPO input Task distribution T , Initial system prompt s, Number of iterations N output Optimized system prompt s 1: Ui {ui} for each task Ti T Initialize user prompt set 2: for N iterations do 3: for each task Ti T do 4: Ui INNERLOOP(s, Ui, Ti) 5: end for 6: U {Ui \| Ti T } 7: s OUTERLOOP(s, U, T ) 8: end for 9: Return s s Algorithm 2 INNERLOOP input Task Ti, System prompt s, Set of user prompt Ui, Number of new candidates m, Top-k size k output Optimized user prompts U i 1: u0 arg max u Ui E(q,a) Ti [f(LLM(s, u, q), a)] Select the best-performing user prompt 2: for m iterations do 3: Wi {(q, a) \| (q, a) Ti, LLM(s, u0, q) = a} Collect incorrect responses 4: Ai Analyzer(s, u0, Wi) Analysis the current user prompt, Table 6 5: u Generator(s, u0, Ai) Generate a candidate user prompt, Table 7 6: Ui Ui {u} 7: end for 8: U i arg max U i Ui,\|U i\|=k E(q,a) Ti Eu U i[f(LLM(s, u, q), a)] Select top-k user prompts 9: Return U i Algorithm 3 OUTERLOOP input Task distribution T , System prompt s, Set of user prompt set U, Number of new candidates m. output Optimized system prompt s 1: s0 s Initialize the system prompt 2: S {s0} Initialize the system prompt set 3: for m iterations do 4: W 5: for each task Ti T do 6: ui arg max u Ui E(q,a) Ti [f(LLM(s0, u, q), a)] Select the best-performing user prompt 7: Wi {(q, a) \| (q, a) Ti, LLM(s0, ui, q) = a} Collect incorrect responses 8: W W Wi 9: end for 10: A Analyzer(s0, W) Analysis the current system prompt, Table 8 11: s Generator(s0, A) Generate a candidate system prompt, Table 9 12: S S {s} 13: end for 14: s arg max s S ETi T ,(q,a) Ti [Eu Ui[f(LLM(s, u, q), a)]] Evaluate across tasks and user prompts 15: Return s
Open Source Code	Yes	Equal contribution; Code is available at https://github.com/Dozi01/Meta SPO.
Open Datasets	Yes	To extensively evaluate the efficacy of the (optimized) system prompts, our evaluation suite spans 5 distinct domains (over 34 different tasks), as follows: Medical which aims to answer medical-related queries [26]; Review Analysis which aims to predict the sentiment of customer reviews [15]; Reasoning which evaluates the logical and analytical thinking of models [7]; Grounding which assesses whether the generated responses are grounded in the provided context [29]; Safety which measures the ability to classify harmful or sensitive content [5].
Dataset Splits	Yes	For data splits, we use predefined train-test splits from datasets (if available), such as Med MCQA, Big Bench, Web QA, and Anthropic Harmless. For datasets without predefined splits, such as Amazon, Natural Questions, and Ethos, we randomly divide the training data to create the test sets. Also, for each task, 50 training samples are randomly selected using different seeds across three experimental runs. As summarized in Table 5, the number of test samples is limited to a maximum of 500 in all tasks to reduce the computational burden.
Hardware Specification	Yes	Experiments are primarily conducted using an NVIDIA A5000 GPU.
Software Dependencies	Yes	For a fair comparison of different approaches, we primarily use Llama 3.2 (3B) [11] as the base model for generating responses, and GPT-4o mini as the prompt optimizer.
Experiment Setup	Yes	For our Meta SPO, we iterate the inner and outer loops three times. Also, in system prompt optimization, we generate nine different prompts and maintain one, while for the user prompt, we generate and maintain three. During the problem analysis step for the current prompts, we use three incorrect examples for the user prompt optimization, and two incorrect examples per task (aggregated over all tasks) for the system prompt optimization. All experiments are conducted with three different runs, and we report their average results. For more details, please refer to A.2.