DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines
Authors: Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct two case studies, showing that succinct DSPy programs can express and optimize pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, DSPy can automatically produce pipelines that outperform out-of-the-box few-shot prompting as well as expert-created demonstrations for GPT-3.5 and Llama2-13b-chat. |
| Researcher Affiliation | Collaboration | 1Stanford University, 2UC Berkeley, 3Carnegie Mellon University, 4Amazon Alexa AI, 5Dashworks Technologies, Inc., 6IIT Bombay, 7Calera Capital, 8Microsoft Qatar, 9Two Sigma Investments |
| Pseudocode | Yes | H.1 BOOTSTRAPFEWSHOT 1 class Simplified Bootstrap Few Shot(Teleprompter): ... H.2 BOOTSTRAPFEWSHOTWITHRANDOMSEARCH 1 class Simplified Bootstrap Few Shot With Random Search(Teleprompter): |
| Open Source Code | Yes | DSPy is available at https://github.com/stanfordnlp/dspy. |
| Open Datasets | Yes | We evaluate on the popular GSM8K dataset with grade school math questions (Cobbe et al., 2021). We sample 200 and 300 question answer pairs from the official training set for training and development, respectively. ... In this case study, we explore the multi-hop question answering task with the Hot Pot QA (Yang et al., 2018) dataset in the open-domain fullwiki setting. |
| Dataset Splits | Yes | We sample 200 and 300 question answer pairs from the official training set for training and development, respectively. Our final evaluations use the 1.3k official test set examples. We report extensive comparisons on the development set to avoid overfitting on test. ... We sub-divide the training set into 70%/30% train/validation splits. |
| Hardware Specification | Yes | DSPy experiments specifically relied on NVIDIA A100-SXM GPUs with 80 Gi Bs memory. |
| Software Dependencies | Yes | Python: Version 3.9 or higher. datasets-2.14.5 transformers-4.32.0 |
| Experiment Setup | Yes | We compile using the Bootstrap Few Shot With Random Search teleprompter with 7 candidate programs. DSPy supports parallel evaluation. In these runs, we set the maximum number of parallel threads to 10. |