Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs

Authors: ChangHao Li, Yuchen Zhuang, Rushi Qiang, Haotian Sun, Hanjun Dai, Chao Zhang, Bo Dai

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on diverse tasks demonstrate that our method effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks. Extensive experiments conducted on three complex tasks demonstrate the effectiveness and generalizability of M-Pilot in improving the advanced problem-solving capabilities of black-box LLMs, with an average improvement of 3.19% in accuracy for reasoning, 7.46% in success rate for planning, and 5.82% in accuracy for personalization.
Researcher Affiliation	Collaboration	Changhao Li1 , Yuchen Zhuang1 , Rushi Qiang1, Haotian Sun1, Hanjun Dai2,3 , Chao Zhang1, Bo Dai1,3 Equal Contribution, 1Georgia Institute of Technology, 2Precur AI, 3Google Deep Mind EMAIL EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Figure 3: Examples of intermediate guidance generated by M-Pilot for complex reasoning, planning, and personalization tasks. ... def solution(agent, start_from=1): ... # General plan: if start_from <= 1: # [Step 1] ... answer = ask(' ...') recep_to_check = literal_eval(answer) ... Appendix G.1: def solution ( agent , s t a r t f r o m =1) : ... # General plan : I need to get a l i s t of receptacles to f i n d the book and take the book with me, then I get another l i s t of receptacles to f i n d the desklamp and turn i t on . # [ Step 1] get a l i s t of receptacles where a book i s l i k e l y to appear .
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code in supplementary materials.
Open Datasets	Yes	Tasks and Datasets. We consider three types of tasks in experiments, each targeting a distinct capability of black-box LLMs: (1) La MP [34] for personalization capabilities, (2) GSM8K [10] for reasoning capabilities, and (3) ALFWorld [39] for planning capabilities.
Dataset Splits	Yes	For the Alf World dataset, the entire training set consists of 8,808 samples. ... For the GSM8K dataset, the full training set comprises 7,473 samples. ... Table 18: Dataset statistics of five different personalization tasks (La MP-1, 2N, 2M, 3, and 4) from the La MP benchmark [34]. Task Type # Train # Validation # Test Input Length Output Length # Profiles # Classes La MP-1 Classification 9682 2500 2500 51.40 5.72 90.61 53.87 2
Hardware Specification	Yes	We conduct all black-box LLM enhancement experiments on CPU: AMD(R) EPYC(R) 7702 64-Core Processor@1.50GHz and GPU: NVIDIA A100-SXM4-80GB using Python 3.10.13. ... During the training phase, we used four H100 GPUs for two rounds of DPO training.
Software Dependencies	No	We conduct all black-box LLM enhancement experiments on CPU: AMD(R) EPYC(R) 7702 64-Core Processor@1.50GHz and GPU: NVIDIA A100-SXM4-80GB using Python 3.10.13. ... For the white-box LLM controller, we utilize LLa MA-3-8B-Instruct as the backbone language model, we also consider Qwen2.5-7B-Instruct as the backbone in Appendix B.5.
Experiment Setup	Yes	F.3.2 Hyperparameter Configurations We set the maximum sequence length for generated solutions to 512 tokens across all tasks and scenarios. The controller model is Llama-3-8B-Instruct, while the environment model is gpt-4o-mini for the primary tasks and gpt-3.5-turbo for specific ablation studies. ... During optimization, we train for two epochs per task using the following hyperparameters: Lo RA rank to 8, Lo RA α to 16, Lo RA dropout to 0.05, learning rate to 1e-5, float type to bf16, max length to 8192, and label smoothing to 0.1.