reproducibilityindex.ai

Quality-Diversity through AI Feedback

Authors: Herbie Bradley, Andrew Dai, Hannah Benita Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Gregory Schott, Joel Lehman

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When assessed on creative writing domains, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities.
Researcher Affiliation	Collaboration	1Carper AI 2CAML Lab, University of Cambridge 3Eleuther AI 4Aleph Alpha 5Department of Computer Science, University of British Columbia 6Vector Institute 7Stability AI 8Canada CIFAR AI Chair 9Maven 10Stochastic Labs
Pseudocode	No	The paper describes the MAP-Elites algorithm and its extensions verbally and with a diagram, but it does not provide any formal pseudocode blocks or algorithms.
Open Source Code	Yes	Project Page: https://qdaif.github.io/
Open Datasets	Yes	We based the domain problem on a task from the Human Eval benchmark (Chen et al., 2021), specifically problem number 88, where the aim is to implement a sorting algorithm that is conditional on the properties of an unsorted list of non-negative integers.
Dataset Splits	No	The paper does not specify explicit training/validation/test dataset splits (e.g., percentages or counts) for the creative writing generation experiments, as the data is dynamically generated. While pre-trained LMs are used, their training splits are not detailed in this paper. Human evaluation is conducted as part of the assessment, but it is not framed as a validation set split.
Hardware Specification	No	The paper mentions using models of different sizes (13B, 30B, 70B parameters) but does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or memory) used to run these models or the experiments.
Software Dependencies	Yes	To generate the text outputs for experiments in the Opinions and Stories domains, we used luminous-base, an autoregressive, causal, decoder-only transformer model (...) A model card1 is provided for additional specifications on the models. (...) We finetuned a 70B model (...) on datasets and prompts from FLAN (Wei et al., 2021), Super-Natural Instructions (Wang et al., 2022b), P3 (Sanh et al., 2021), and chain-of-thought datasets inspired by the approach of Chung et al. (2022) (...) This approach resulted in a model that performed relatively well on instruction-following tasks, especially for the classification of arbitrary measures of natural language texts.
Experiment Setup	Yes	Models and Setup. Details on the LMX generation model (Appendix A.24) and finetuned AI feedback model (Appendix A.25) are given, with details on the training of these LMs. Additional default hyperparameters are described in Appendix A.27. (...) APPENDIX A.27 DEFAULT HYPERPARAMETERS FOR QDAIF WITH LMX: Mutation Model Inference Setup: Model size: 13B (default, except for experiments on scaling model size) LM sampling softmax temperature: 0.8 Number of few-shot examples used: 3 Max output tokens limit (Opinions): 50 Max output tokens limit (Stories): 100 Stop sequence patterns: (...) MAP-Elites Hyperparameters: Number of archive population initialization iterations: 50 Number of total search iterations: 2000 (5000 for experiments using 2D archive) Iteration batch size: 1 Number of bin intervals: 20 Fitness function range: [0, 1] Bin tick intervals in the range [0, 1] (non-uniform): (...) 2D domain bin tick intervals in the range [0, 1] (non-uniform): (...) Archive bin depth limit: 100 Prompt pool initial size (Zero-Shot Init): 10