Joint Prompt Optimization of Stacked LLMs using Variational Inference
Authors: Alessandro Sordoni, Eric Yuan, Marc-Alexandre Côté, Matheus Pereira, Adam Trischler, Ziang Xiao, Arian Hosseini, Friederike Niedtner, Nicolas Le Roux
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful. |
| Researcher Affiliation | Collaboration | Microsoft Research Montréala MILAb |
| Pseudocode | Yes | Algorithm 1 One-Layer Language Network (DLN-1) Training Algorithm; Algorithm 2 Two-Layer Deep Language Network (DLN-2) Training Algorithm; Algorithm 3 Deep Language Network Training Algorithm |
| Open Source Code | Yes | The DLN code is open source.1 ... 1Code: https://github.com/microsoft/deep-language-networks. |
| Open Datasets | Yes | We adopt a set of nine NLP and reasoning tasks commonly used in prior work studying zeroor few-shot learning capabilities of LLMs [23, 10, 39, 42, 1]. ... Table 3: Tasks used in this work. |train| |valid| |test| |class| Description |
| Dataset Splits | Yes | For tasks adopted from Big Bench-Hard (BBH) [42] (Hyper., Nav., Date. and Logic.72), we use the 250 data points provided by BBH as test set. We take the remaining data points from Big Bench [39] that were not included in BBH, and randomly split them (evenly) into training and validation sets. ... For tasks adopted from Leopard [1] (Disaster and Airline), we randomly sample 400, 250, and 250 data points as training, valid, and test. |
| Hardware Specification | No | The paper states, "Throughout this paper, we use Open AI s models, specifically GPT-3 (text-davinci-003) and GPT-4, as the backbone to our proposed systems," implying the use of OpenAI's API. It does not specify the underlying hardware (e.g., GPU models, CPU types) used by OpenAI or any local experimental hardware. |
| Software Dependencies | No | The paper mentions using "OpenAI's models, specifically GPT-3 (text-davinci-003) and GPT-4," but it does not specify any software libraries, frameworks (like PyTorch or TensorFlow), or other software components with version numbers. |
| Experiment Setup | Yes | For DLNs, we use a batch size of 20 and train for 20 iterations by early-stopping on validation performance evaluated every 2 iterations. We sample N = 20 prompt proposals and K = 5 hidden samples. ... We report hyperparameter search space in Table 10. ... We use bh_tpl = "v3.5", tolerance = 2, use_memory = 2, held_out_prompt_ranking = True, logp_penalty = 0.5. |