reproducibilityindex.ai

DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines

Authors: Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat.
Researcher Affiliation	Collaboration	1Stanford University, 2UC Berkeley, 3Carnegie Mellon University, 4Amazon Alexa AI, 5Dashworks Technologies, Inc., 6IIT Bombay, 7Calera Capital, 8Microsoft Qatar, 9Two Sigma Investments
Pseudocode	Yes	H.1 BOOTSTRAPFEWSHOT 1 class Simplified Bootstrap Few Shot(Teleprompter): ... H.2 BOOTSTRAPFEWSHOTWITHRANDOMSEARCH 1 class Simplified Bootstrap Few Shot With Random Search(Teleprompter):
Open Source Code	Yes	DSPy is available at https://github.com/stanfordnlp/dspy.
Open Datasets	Yes	We evaluate on the popular GSM8K dataset with grade school math questions (Cobbe et al., 2021). We sample 200 and 300 question answer pairs from the official training set for training and development, respectively. ... In this case study, we explore the multi-hop question answering task with the Hot Pot QA (Yang et al., 2018) dataset in the open-domain fullwiki setting.
Dataset Splits	Yes	We sample 200 and 300 question answer pairs from the official training set for training and development, respectively. Our final evaluations use the 1.3k official test set examples. We report extensive comparisons on the development set to avoid overfitting on test. ... We sub-divide the training set into 70%/30% train/validation splits.
Hardware Specification	Yes	DSPy experiments specifically relied on NVIDIA A100-SXM GPUs with 80 Gi Bs memory.
Software Dependencies	Yes	Python: Version 3.9 or higher. datasets-2.14.5 transformers-4.32.0
Experiment Setup	Yes	We compile using the Bootstrap Few Shot With Random Search teleprompter with 7 candidate programs. DSPy supports parallel evaluation. In these runs, we set the maximum number of parallel threads to 10.