Joint Prompt Optimization of Stacked LLMs using Variational Inference

Authors: Alessandro Sordoni, Eric Yuan, Marc-Alexandre Côté, Matheus Pereira, Adam Trischler, Ziang Xiao, Arian Hosseini, Friederike Niedtner, Nicolas Le Roux

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.
Researcher Affiliation Collaboration Microsoft Research Montréala MILAb
Pseudocode Yes Algorithm 1 One-Layer Language Network (DLN-1) Training Algorithm; Algorithm 2 Two-Layer Deep Language Network (DLN-2) Training Algorithm; Algorithm 3 Deep Language Network Training Algorithm
Open Source Code Yes The DLN code is open source.1 ... 1Code: https://github.com/microsoft/deep-language-networks.
Open Datasets Yes We adopt a set of nine NLP and reasoning tasks commonly used in prior work studying zeroor few-shot learning capabilities of LLMs [23, 10, 39, 42, 1]. ... Table 3: Tasks used in this work. |train| |valid| |test| |class| Description
Dataset Splits Yes For tasks adopted from Big Bench-Hard (BBH) [42] (Hyper., Nav., Date. and Logic.72), we use the 250 data points provided by BBH as test set. We take the remaining data points from Big Bench [39] that were not included in BBH, and randomly split them (evenly) into training and validation sets. ... For tasks adopted from Leopard [1] (Disaster and Airline), we randomly sample 400, 250, and 250 data points as training, valid, and test.
Hardware Specification No The paper states, "Throughout this paper, we use Open AI s models, specifically GPT-3 (text-davinci-003) and GPT-4, as the backbone to our proposed systems," implying the use of OpenAI's API. It does not specify the underlying hardware (e.g., GPU models, CPU types) used by OpenAI or any local experimental hardware.
Software Dependencies No The paper mentions using "OpenAI's models, specifically GPT-3 (text-davinci-003) and GPT-4," but it does not specify any software libraries, frameworks (like PyTorch or TensorFlow), or other software components with version numbers.
Experiment Setup Yes For DLNs, we use a batch size of 20 and train for 20 iterations by early-stopping on validation performance evaluated every 2 iterations. We sample N = 20 prompt proposals and K = 5 hidden samples. ... We report hyperparameter search space in Table 10. ... We use bh_tpl = "v3.5", tolerance = 2, use_memory = 2, held_out_prompt_ranking = True, logp_penalty = 0.5.