reproducibilityindex.ai

PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

Authors: Chenrui Zhang, Lin Liu, Chuyuan Wang, Xiao Sun, Hongyu Wang, Jinpeng Wang, Mingchen Cai

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin.
Researcher Affiliation	Collaboration	1Meituan Inc., Beijing, China 2School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Pseudocode	Yes	Algorithm 1: Our PREFER Algorithm
Open Source Code	Yes	We have made our code publicly available. Our implementation1 is available online. 1https://github.com/zcrwind/PREFER
Open Datasets	Yes	Datasets We follow the experimental settings of the compared works to conduct experiments on a wide range of tasks including natural language inference and classification: Natural Language Inference SNLI (Bowman et al. 2015), MNLI (Williams, Nangia, and Bowman 2017), and RTE (Dagan, Glickman, and Magnini 2005): textual entailment inference; QNLI (Rajpurkar et al. 2016): question-answering inference. Natural Language Classification Ethos (Mollas et al. 2020): hate speech detection; Liar (Wang 2017): fake news classification; Ar Sarcasm (Farha and Magdy 2020): Arabic sarcasm detection.
Dataset Splits	No	To make a fair comparison, we closely follow the experimental protocols that were set up in APO with our own data split. In detail, we mainly conduct developing and evaluation of our PREFER in few-shot settings. For each task, we randomly sample k examples from the original training dataset, to build k-shot training set Dtr. By default, the k in this paper is set to 50. The paper mentions a "data split" and a "k-shot training set" but does not provide specific percentages or counts for training, validation, and test sets.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware used to run its experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using "GPT-3.5-turbo as M" but does not list any specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow, specific libraries).
Experiment Setup	Yes	In detail, we mainly conduct developing and evaluation of our PREFER in few-shot settings. For each task, we randomly sample k examples from the original training dataset, to build k-shot training set Dtr. By default, the k in this paper is set to 50.