Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
BatchPrompt: Accomplish more with less
Authors: Jianzhe Lin, Maurice Diesendruck, Liang Du, Robin Abraham
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive experimental evaluation demonstrates that BPE + SEAS can boost the performance of Batch Prompt by a striking margin on a range of popular NLP tasks, including question answering (Boolq), textual entailment (RTE), and duplicate questions identification (QQP). |
| Researcher Affiliation | Industry | Jianzhe Lin, Maurice Diesendruck, Liang Du, Robin Abraham Microsoft |
| Pseudocode | Yes | The method is described using the pseudo-code in Alg. 1. |
| Open Source Code | Yes | Code: github.com/microsoft/Batch Prompt |
| Open Datasets | Yes | Boolq: Boolean Questions (Boolq) is a question-answering dataset for yes/no questions containing 15942 examples (9427 for training, 3270 for validation, 3245 for testing). |
| Dataset Splits | Yes | Boolq: Boolean Questions (Boolq) is a question-answering dataset for yes/no questions containing 15942 examples (9427 for training, 3270 for validation, 3245 for testing). |
| Hardware Specification | No | The paper mentions using "gpt-3.5-turbo and GPT-4" for evaluation but does not provide specific hardware details such as GPU models, CPU specifications, or memory. |
| Software Dependencies | No | The paper mentions using "gpt-3.5-turbo and GPT-4" but does not specify their version numbers or any other software dependencies (e.g., Python, libraries) with version numbers. |
| Experiment Setup | Yes | We use 2, 4, and 4 few shot examples for RTE, QQP, Bool Q respectively... Temperature is always set to 0 for consistent results... The batch sizes we use for RTE, QQP, Bool Q are 16/32/64/160... for GPT-4... and 16/32 for gpt-3.5-turbo... The number of voting rounds we choose is 1, 3, 5, 7, and 9. |