reproducibilityindex.ai

Fairness-guided Few-shot Prompting for Large Language Models

Authors: Huan Ma, Changqing Zhang, Yatao Bian, Lemao Liu, Zhirui Zhang, Peilin Zhao, Shu Zhang, Huazhu Fu, Qinghua Hu, Bingzhe Wu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform comprehensive experiments with state-of-the-art mainstream models such as GPT-3 on various downstream tasks. Our results indicate that our method can enhance the model s in-context learning performance in an effective and interpretable manner.
Researcher Affiliation	Collaboration	Huan Ma1,2, Changqing Zhang1 , Yatao Bian2, Lemao Liu2, Zhirui Zhang2, Peilin Zhao2, Shu Zhang2, Huazhu Fu3, Qinghua Hu1, Bingzhe Wu2 1 College of Intelligence and Computing, Tianjin University, Tianjin, China 2 AI Lab, Tencent, Shenzhen, China 3 Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore
Pseudocode	Yes	Algorithm 1 T-fair-Prompting and Algorithm 2 G-fair-Prompting provide detailed pseudocode for the proposed strategies.
Open Source Code	Yes	Code is available at: https://github.com/Ma Huan AAA.
Open Datasets	Yes	We conducted experiments on various text classification datasets [21], namely SST-2, AGNews, Co LA, TREC, and RTE. [21] refers to Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding.
Dataset Splits	Yes	A naive strategy is to enumerate all candidates to find the prompt that can achieve the best performance on validation set... Specifically, in a scenario with four training samples (due to the time-consuming nature of enumerating all prompt cases for a larger number), we enumerate all possible combinations and permutations of demonstrations for various datasets and LLMs.
Hardware Specification	Yes	Hardware: BLOOM=A100, LLa MA=V100.
Software Dependencies	No	The paper mentions models like GPT-3, BLOOM, and LLa MA, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used in the implementation.
Experiment Setup	Yes	For each dataset, we set the size of the training set to 4. We conducted experiments on different settings and reported the results of five runs. To elucidate the impact of demonstration selection, we select four demonstrations for each different seed and randomly sample an order for each combination. The maximum input length of LLa MA is 512.