Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

metaTextGrad: Automatically optimizing language model optimizers

Authors: Guowei Xu, Mert Yuksekgonul, Carlos Guestrin, James Y Zou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted experiments on multiple benchmarks, and the results demonstrate that our meta-optimized optimizers consistently outperform the existing ones. Overall, we summarize our contributions as follows. First, we introduce the concept of meta-optimization, highlighting that existing LLM optimizers often require further task alignment and effective combination through a meta-optimizer to maximize their potential. Lastly, experimental results on multiple benchmarks demonstrate that our method significantly outperforms baseline approaches in both performance and generalization.
Researcher Affiliation	Academia	Guowei Xu Tsinghua University Mert Yuksekgonul Stanford University Carlos Guestrin Stanford University James Zou Stanford University
Pseudocode	Yes	Algorithm 1 Inner Loop: Optimize Φ with Optimizer M; Algorithm 2 Meta-Optimization of Optimizers; Algorithm 3 Meta Prompt Optimizer; Algorithm 4 Meta Structure Optimizer
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide open code access.
Open Datasets	Yes	We evaluate our approach and the baselines on multiple benchmarks, including BBH [13, 14], MMLU [15], and GPQA [16].
Dataset Splits	Yes	The BBH Word Sorting task requires the language model to sort a given set of words in order, while the BBH Dyck Languages task involves providing a string composed of various types of brackets and asking the model to determine the characters needed to complete the bracket pairing. We use the same train/validation/test splits as in Text Grad [1] (i.e., 50, 100, and 100 instances for training, validation, and testing) and follow the Text Grad approach by using GPT-4o to evaluate whether the model s output is correct, based on the predicted and ground truth answers. MMLU. We selected the MMLU Abstract Algebra dataset and created training, validation, and test sets consisting of 10, 50, and 40 questions, respectively. GPQA Diamond. The benchmark consists of 198 questions, which we split into training, validation, and test datasets containing 30, 100, and 68 questions.
Hardware Specification	No	The API is the only compute resource we used. We report the details of API usage in Section 4. We use GPT-4o-mini for LLM calls within the program, GPT-4o for the MIPROv2 and TGD optimizers, and the o1 model for the structure optimizer and the meta-optimizers.
Software Dependencies	No	The paper mentions using specific LLM models (GPT-4o-mini, GPT-4o, o1 model) and libraries like `textgrad`, `typing`, `collections`, `inspect`, `copy`, but it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	To ensure reproducibility, we provide the prompts and structures of the learned programs in Appendix E. In our experiments, we use GPT-4o-mini for LLM calls within the program, GPT-4o for the MIPROv2 and TGD optimizers, and the o1 model for the structure optimizer and the meta-optimizers. Due to the non-determinism of LLM APIs [21], the test accuracy for each benchmark is averaged over five random seeds. Although we allocate six optimization steps per training epoch, we observe that meta-optimized optimizers, due to their stronger task alignment, often achieve significant improvements within the first 1-2 steps.