Customizing Language Model Responses with Contrastive In-Context Learning

Authors: Xiang Gao, Kamalika Das

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We tested our approach on both synthesized and real-world datasets, including Stack Exchange and Reddit, and found that it significantly improves performance compared to standard few-shot prompting.
Researcher Affiliation Industry Intuit AI Research, 2700 Coast Avenue, Mountain View, CA 94043 {xiang gao, kamalika das}@intuit.com
Pseudocode No The paper describes its method in text and uses Figure 2 to illustrate the prompting process but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any specific links or explicit statements about releasing the source code for the described methodology.
Open Datasets Yes We tested our approach on both synthesized and real-world datasets, including Stack Exchange and Reddit, and found that it significantly improves performance compared to standard few-shot prompting. Our focus is on subjective preferences, so we created a dataset using data from cooking.stackexchange.com, which includes how-to type cooking-related questions. We experimented a subreddit called No Stupid Questions.
Dataset Splits No We randomly selected 500 samples for evaluation.
Hardware Specification No The paper does not provide specific details on the hardware used, such as GPU models, CPU types, or memory specifications. It only mentions the LLMs used (GPT-3, Chat GPT, GPT-4).
Software Dependencies No We consider two types of large language models (LLMs): non-conversational LLMs, such as GPT-3, and conversational LLMs, including Chat GPT (GPT-3.5-turbo) and GPT-4. We employ two metrics, BERT Score (Zhang et al. 2019) and Emb. Similarity. The latter is the cosine similarity of the sentence embeddings obtained using the Sentence-BERT model (Reimers and Gurevych 2019). We leverage Dialog RPT (Gao et al. 2020), a pretrained dialog response ranking model.
Experiment Setup Yes In our experiments, we consider two types of large language models (LLMs): non-conversational LLMs, such as GPT-3, and conversational LLMs, including Chat GPT (GPT-3.5-turbo) and GPT-4. We evaluate these LLMs under two different settings: zero-shot and few-shot. For each query, we randomly select k labeled examples as few-shot examples 1. [...] We experimented with k = 1 to k = 4 and stops when performance does not increase significantly as k increases. For the few-shot setting, we compare the standard approach, which only includes positive examples, with three settings involving contrastive examples: Contrastive Examples Only, Contrastive Instruction Only, Contrastive Examples + Instruction.