reproducibilityindex.ai

Joint Prompt Optimization of Stacked LLMs using Variational Inference

Authors: Alessandro Sordoni, Eric Yuan, Marc-Alexandre Côté, Matheus Pereira, Adam Trischler, Ziang Xiao, Arian Hosseini, Friederike Niedtner, Nicolas Le Roux

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.
Researcher Affiliation	Collaboration	Microsoft Research Montréala MILAb
Pseudocode	Yes	Algorithm 1 One-Layer Language Network (DLN-1) Training Algorithm; Algorithm 2 Two-Layer Deep Language Network (DLN-2) Training Algorithm; Algorithm 3 Deep Language Network Training Algorithm
Open Source Code	Yes	The DLN code is open source.1 ... 1Code: https://github.com/microsoft/deep-language-networks.
Open Datasets	Yes	We adopt a set of nine NLP and reasoning tasks commonly used in prior work studying zeroor few-shot learning capabilities of LLMs [23, 10, 39, 42, 1]. ... Table 3: Tasks used in this work. \|train\| \|valid\| \|test\| \|class\| Description
Dataset Splits	Yes	For tasks adopted from Big Bench-Hard (BBH) [42] (Hyper., Nav., Date. and Logic.72), we use the 250 data points provided by BBH as test set. We take the remaining data points from Big Bench [39] that were not included in BBH, and randomly split them (evenly) into training and validation sets. ... For tasks adopted from Leopard [1] (Disaster and Airline), we randomly sample 400, 250, and 250 data points as training, valid, and test.
Hardware Specification	No	The paper states, "Throughout this paper, we use Open AI s models, specifically GPT-3 (text-davinci-003) and GPT-4, as the backbone to our proposed systems," implying the use of OpenAI's API. It does not specify the underlying hardware (e.g., GPU models, CPU types) used by OpenAI or any local experimental hardware.
Software Dependencies	No	The paper mentions using "OpenAI's models, specifically GPT-3 (text-davinci-003) and GPT-4," but it does not specify any software libraries, frameworks (like PyTorch or TensorFlow), or other software components with version numbers.
Experiment Setup	Yes	For DLNs, we use a batch size of 20 and train for 20 iterations by early-stopping on validation performance evaluated every 2 iterations. We sample N = 20 prompt proposals and K = 5 hidden samples. ... We report hyperparameter search space in Table 10. ... We use bh_tpl = "v3.5", tolerance = 2, use_memory = 2, held_out_prompt_ranking = True, logp_penalty = 0.5.