Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding prompt engineering may not require rethinking generalization
Authors: Victor Akinwande, Yiding Jiang, Dylan Sam, J Zico Kolter
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate empirically that this holds for existing handcrafted prompts and prompts generated through simple greedy search. ... 5 EXPERIMENTS In this section, we evaluate the generalization of discrete prompts generated by Greedy on CIFAR10, CIFAR-100, Image Net as well as domain generalization datasets f Mo W (Christie et al., 2018) and Office Home (Venkateswara et al., 2017), which is much less studied in the context of numerical generalization bounds. |
| Researcher Affiliation | Collaboration | Victor Akinwande1, Yiding Jiang1, Dylan Sam1 & J. Zico Kolter1,2 1Carnegie Mellon University, 2Bosch Center for AI |
| Pseudocode | Yes | A PESUDOCODE Algorithm 1 Sequential Prompt Search |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the methodology described is openly available. |
| Open Datasets | Yes | In this section, we evaluate the generalization of discrete prompts generated by Greedy on CIFAR10, CIFAR-100, Image Net as well as domain generalization datasets f Mo W (Christie et al., 2018) and Office Home (Venkateswara et al., 2017) |
| Dataset Splits | No | The paper describes using a 'split portion of the dataset s {0.1, . . . , 1.0}' for its experiments and mentions training and testing data, but it does not explicitly define or specify a separate 'validation' dataset split with percentages or counts for reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like CLIP and LLaMA-7B (Touvron et al., 2023), but it does not provide specific version numbers for multiple key software libraries, frameworks, or programming languages used to run the experiments. |
| Experiment Setup | Yes | C EXPERIMENTAL DETAILS Hyperparameters We report the hyperparameters used in CLIP, LLa MA-7b, and the Greedy algorithm in Table 4. Table 4: Hyperparameters used in CLIP, LLa MA-7b and Greedy. Hyperparameter Value Batch size 100 CLIP Vocabulary size 49,408 LLa MA-7B Vocabulary size 32,000 Temperature 1.0 Bound δ 0.01 SRM β 1.0 |