reproducibilityindex.ai

KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

Authors: Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, Besmira Nushi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases.
Researcher Affiliation	Collaboration	1Microsoft Research, 2University of Illinois Urbana-Champaign, 3Stanford University
Pseudocode	No	The paper includes 'PROMPTING TEMPLATES' in Appendix D (e.g., '[TEMPLATE 1 ALL-BOOKS]'), which describe structured steps, but they are not explicitly labeled as 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	While the paper states 'We open source our contributions', the specific link provided (https://huggingface.co/datasets/microsoft/kitab) is for the KITAB dataset and does not explicitly state it contains the source code for the methodology.
Open Datasets	Yes	We present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. 1https://huggingface.co/datasets/microsoft/kitab
Dataset Splits	No	The paper introduces KITAB as a new dataset for evaluating LLMs. It describes different experimental conditions and queries (e.g., 'queries with one book constraint' and 'queries with two book constraints') within KITAB for evaluation, but does not specify traditional train/validation/test splits of the KITAB dataset itself as it is used to evaluate pre-trained LLMs.
Hardware Specification	No	The paper evaluates proprietary models (GPT4 and GPT3.5) and does not provide specific details about the hardware used to run these models or the experiments.
Software Dependencies	No	The paper mentions using 'Azure Cognitive Services Language API', 'Geonames', and 'fuzzy matching' (with a GitHub link provided) for data collection and cleaning, but it does not provide specific version numbers for these software components or any other libraries used for the experiments.
Experiment Setup	Yes	All experiments were done with temperature 0. Maximum token length = 1000 for [TEMPLATE 1 ALL-BOOKS] and [TEMPLATE 2B WITH-CONTEXT] (200 for SINGLE-ITEM), Maximum token length = 400 for [TEMPLATE 2A NO-CONTEXT], and Maximum token length = 3000 for [TEMPLATE 3 SELF-CONTEXT] (as detailed in Appendix D).