KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

Authors: Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, Besmira Nushi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases.
Researcher Affiliation Collaboration 1Microsoft Research, 2University of Illinois Urbana-Champaign, 3Stanford University
Pseudocode No The paper includes 'PROMPTING TEMPLATES' in Appendix D (e.g., '[TEMPLATE 1 ALL-BOOKS]'), which describe structured steps, but they are not explicitly labeled as 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No While the paper states 'We open source our contributions', the specific link provided (https://huggingface.co/datasets/microsoft/kitab) is for the KITAB *dataset* and does not explicitly state it contains the source code for the methodology.
Open Datasets Yes We present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. 1https://huggingface.co/datasets/microsoft/kitab
Dataset Splits No The paper introduces KITAB as a new dataset for evaluating LLMs. It describes different experimental conditions and queries (e.g., 'queries with one book constraint' and 'queries with two book constraints') within KITAB for evaluation, but does not specify traditional train/validation/test splits of the KITAB dataset itself as it is used to evaluate pre-trained LLMs.
Hardware Specification No The paper evaluates proprietary models (GPT4 and GPT3.5) and does not provide specific details about the hardware used to run these models or the experiments.
Software Dependencies No The paper mentions using 'Azure Cognitive Services Language API', 'Geonames', and 'fuzzy matching' (with a GitHub link provided) for data collection and cleaning, but it does not provide specific version numbers for these software components or any other libraries used for the experiments.
Experiment Setup Yes All experiments were done with temperature 0. Maximum token length = 1000 for [TEMPLATE 1 ALL-BOOKS] and [TEMPLATE 2B WITH-CONTEXT] (200 for SINGLE-ITEM), Maximum token length = 400 for [TEMPLATE 2A NO-CONTEXT], and Maximum token length = 3000 for [TEMPLATE 3 SELF-CONTEXT] (as detailed in Appendix D).