reproducibilityindex.ai

Order-Independence Without Fine Tuning

Authors: Reid McIlroy-Young, Katrina Brown, Conlan Olson, Linjun Zhang, Cynthia Dwork

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we present Set-Based Prompting, a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences. We show that this method provably eliminates order dependency, and that it can be applied to any transformer-based LLM to enable text generation that is unaffected by re-orderings. Delving into the implications of our method, we show that, despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of uniformly chosen shuffling of the candidate responses, and usually significantly less in practice. Thus, Set-Based Prompting can be used as a dropped-in method on fully trained models. Finally, we discuss how our method s success suggests that other strong guarantees can be obtained on LLM performance via modifying the input representations.
Researcher Affiliation	Academia	Reid Mc Ilroy-Young Department of Computer Science Harvard University Katrina Brown Department of Computer Science Harvard University Conlan Olson Department of Computer Science Columbia University Linjun Zhang Department of Statistics Rutgers University Cynthia Dwork Department of Computer Science Harvard University
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured steps in a code-like format.
Open Source Code	Yes	Code is available at github.com/reidmcy/set-based-prompting. Full code for our experiments is available as a part of the final paper release. The code is released under the MIT license. The code can be found at this URL: https://github.com/reidmcy/set-based-prompting.
Open Datasets	Yes	For testing we used two standard test sets Commonsense QA (CSQA) (Talmor et al., 2019) and (Measuring Massive Multitask Language Understanding) (MMLU) (Hendrycks et al., 2020); these were chosen because they are multiple choice question sets that allow us to use the multiple options as the parallel sub-sequences.
Dataset Splits	No	The paper mentions using 'standard test sets' and '1-shot evaluation' but does not explicitly state specific training, validation, or test split percentages, sample counts, or specific citations for how these datasets were split for their experiments.
Hardware Specification	Yes	Most of our results were done on a single Nvidia A100 GPU with 80GB of memory, running one model, thus the lack of 70B model results.
Software Dependencies	No	The paper mentions using 'huggingface models' and different LLM families (GPT-2, Llama 2, Llama 3, Mistral), and 'greedy decoding with the temperature 0', but it does not specify version numbers for general software dependencies like Python, PyTorch, TensorFlow, or specific libraries.
Experiment Setup	Yes	We implement the normal 1-shot evaluation for these tests that looks at the probability of each answer being generated, and we use the greatest value as the selected answer. We perform the greedy decoding with the temperature 0, and apply the zero-shot prompting in all experiments. Instead of using numbers or letters before each option we quote the options in double quotes (') and separate them with a space.