PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

Authors: Chenrui Zhang, Lin Liu, Chuyuan Wang, Xiao Sun, Hongyu Wang, Jinpeng Wang, Mingchen Cai

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin.
Researcher Affiliation Collaboration 1Meituan Inc., Beijing, China 2School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Pseudocode Yes Algorithm 1: Our PREFER Algorithm
Open Source Code Yes We have made our code publicly available. Our implementation1 is available online. 1https://github.com/zcrwind/PREFER
Open Datasets Yes Datasets We follow the experimental settings of the compared works to conduct experiments on a wide range of tasks including natural language inference and classification: Natural Language Inference SNLI (Bowman et al. 2015), MNLI (Williams, Nangia, and Bowman 2017), and RTE (Dagan, Glickman, and Magnini 2005): textual entailment inference; QNLI (Rajpurkar et al. 2016): question-answering inference. Natural Language Classification Ethos (Mollas et al. 2020): hate speech detection; Liar (Wang 2017): fake news classification; Ar Sarcasm (Farha and Magdy 2020): Arabic sarcasm detection.
Dataset Splits No To make a fair comparison, we closely follow the experimental protocols that were set up in APO with our own data split. In detail, we mainly conduct developing and evaluation of our PREFER in few-shot settings. For each task, we randomly sample k examples from the original training dataset, to build k-shot training set Dtr. By default, the k in this paper is set to 50. The paper mentions a "data split" and a "k-shot training set" but does not provide specific percentages or counts for training, validation, and test sets.
Hardware Specification No The paper does not provide any specific details regarding the hardware used to run its experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using "GPT-3.5-turbo as M" but does not list any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow, specific libraries).
Experiment Setup Yes In detail, we mainly conduct developing and evaluation of our PREFER in few-shot settings. For each task, we randomly sample k examples from the original training dataset, to build k-shot training set Dtr. By default, the k in this paper is set to 50.