Fairness-guided Few-shot Prompting for Large Language Models
Authors: Huan Ma, Changqing Zhang, Yatao Bian, Lemao Liu, Zhirui Zhang, Peilin Zhao, Shu Zhang, Huazhu Fu, Qinghua Hu, Bingzhe Wu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform comprehensive experiments with state-of-the-art mainstream models such as GPT-3 on various downstream tasks. Our results indicate that our method can enhance the model s in-context learning performance in an effective and interpretable manner. |
| Researcher Affiliation | Collaboration | Huan Ma1,2, Changqing Zhang1 , Yatao Bian2, Lemao Liu2, Zhirui Zhang2, Peilin Zhao2, Shu Zhang2, Huazhu Fu3, Qinghua Hu1, Bingzhe Wu2 1 College of Intelligence and Computing, Tianjin University, Tianjin, China 2 AI Lab, Tencent, Shenzhen, China 3 Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore |
| Pseudocode | Yes | Algorithm 1 T-fair-Prompting and Algorithm 2 G-fair-Prompting provide detailed pseudocode for the proposed strategies. |
| Open Source Code | Yes | Code is available at: https://github.com/Ma Huan AAA. |
| Open Datasets | Yes | We conducted experiments on various text classification datasets [21], namely SST-2, AGNews, Co LA, TREC, and RTE. [21] refers to Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. |
| Dataset Splits | Yes | A naive strategy is to enumerate all candidates to find the prompt that can achieve the best performance on validation set... Specifically, in a scenario with four training samples (due to the time-consuming nature of enumerating all prompt cases for a larger number), we enumerate all possible combinations and permutations of demonstrations for various datasets and LLMs. |
| Hardware Specification | Yes | Hardware: BLOOM=A100, LLa MA=V100. |
| Software Dependencies | No | The paper mentions models like GPT-3, BLOOM, and LLa MA, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries used in the implementation. |
| Experiment Setup | Yes | For each dataset, we set the size of the training set to 4. We conducted experiments on different settings and reported the results of five runs. To elucidate the impact of demonstration selection, we select four demonstrations for each different seed and randomly sample an order for each combination. The maximum input length of LLa MA is 512. |