Customizing Language Model Responses with Contrastive In-Context Learning
Authors: Xiang Gao, Kamalika Das
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We tested our approach on both synthesized and real-world datasets, including Stack Exchange and Reddit, and found that it significantly improves performance compared to standard few-shot prompting. |
| Researcher Affiliation | Industry | Intuit AI Research, 2700 Coast Avenue, Mountain View, CA 94043 {xiang gao, kamalika das}@intuit.com |
| Pseudocode | No | The paper describes its method in text and uses Figure 2 to illustrate the prompting process but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about releasing the source code for the described methodology. |
| Open Datasets | Yes | We tested our approach on both synthesized and real-world datasets, including Stack Exchange and Reddit, and found that it significantly improves performance compared to standard few-shot prompting. Our focus is on subjective preferences, so we created a dataset using data from cooking.stackexchange.com, which includes how-to type cooking-related questions. We experimented a subreddit called No Stupid Questions. |
| Dataset Splits | No | We randomly selected 500 samples for evaluation. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used, such as GPU models, CPU types, or memory specifications. It only mentions the LLMs used (GPT-3, Chat GPT, GPT-4). |
| Software Dependencies | No | We consider two types of large language models (LLMs): non-conversational LLMs, such as GPT-3, and conversational LLMs, including Chat GPT (GPT-3.5-turbo) and GPT-4. We employ two metrics, BERT Score (Zhang et al. 2019) and Emb. Similarity. The latter is the cosine similarity of the sentence embeddings obtained using the Sentence-BERT model (Reimers and Gurevych 2019). We leverage Dialog RPT (Gao et al. 2020), a pretrained dialog response ranking model. |
| Experiment Setup | Yes | In our experiments, we consider two types of large language models (LLMs): non-conversational LLMs, such as GPT-3, and conversational LLMs, including Chat GPT (GPT-3.5-turbo) and GPT-4. We evaluate these LLMs under two different settings: zero-shot and few-shot. For each query, we randomly select k labeled examples as few-shot examples 1. [...] We experimented with k = 1 to k = 4 and stops when performance does not increase significantly as k increases. For the few-shot setting, we compare the standard approach, which only includes positive examples, with three settings involving contrastive examples: Contrastive Examples Only, Contrastive Instruction Only, Contrastive Examples + Instruction. |