Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Genetic Prompt Search via Exploiting Language Model Probabilities
Authors: Jiangjiang Zhao, Zhuoran Wang, Fangchun Yang
IJCAI 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on diverse benchmark datasets show that the proposed precondition-free method significantly outperforms the existing DFO-style counterparts that require preconditions, including blackbox tuning, genetic prompt search and gradientfree instructional prompt search. |
| Researcher Affiliation | Collaboration | Jiangjiang Zhao1,2 , Zhuoran Wang3 , Fangchun Yang1 1Beijing University of Posts and Telecommunications, P.R. China 2China Mobile Online Services Co., Ltd. Beijing, P.R. China 3Clouchie Limited, London, United Kingdom |
| Pseudocode | Yes | Algorithm 1 gives the pseudo-code of the proposed GAP3, where hyperparameters and constant objects are denoted in italic type. |
| Open Source Code | Yes | 1Code and supplementary material available at: https://github. com/zjjhit/gap3 |
| Open Datasets | Yes | The datasets used in the main experiments consist of 7 benchmark NLP tasks, which are the same as in [Sun et al., 2022b], including Yelp polarity, AG s News and DBPedia from [Zhang et al., 2015], SST-2, MRPC and RTE from the GLUE benchmarks [Wang et al., 2018], as well as SNLI [Bowman et al., 2015]. |
| Dataset Splits | No | The paper describes the creation of k-shot training sets and the use of original test sets or development sets as test sets, but does not explicitly define a separate validation set for the main model training. |
| Hardware Specification | No | The paper mentions 'computing power' in the acknowledgements but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions the use of various pretrained language models and optimizers but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We set GAP3 s population size N = 64 and iteration number M = 50, with crossover and mutation probabilities ρc = 0.5 and ρm = 0.75, respectively. For PT, with learning rate 5e-4 and batch size 16, it runs for 1000 epochs. For full-model FT, with the same batch size, but learning rate 1e-5, we run it for 200 epochs. |