COLLIE: Systematic Construction of Constrained Text Generation Tasks
Authors: Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, Karthik R Narasimhan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings. |
| Researcher Affiliation | Academia | Department of Computer Science, Princeton University {shunyuy, hc22, hjwang, runzhey, karthikn}@princeton.edu |
| Pseudocode | Yes | Figure 2: Example COLLIE code for a simple number of words constraint. |
| Open Source Code | Yes | Project site with code and data: https://collie-benchmark.github.io. |
| Open Datasets | Yes | We extract constraint targets from three distinct data sources: Wikipedia (Wiki) (Foundation, 2022), Common Crawl News (CC-News) (Hamborg et al., 2017), and the Project Gutenberg Corpus (Guten) (Brooke et al., 2015). |
| Dataset Splits | No | The paper evaluates pre-trained language models in a zero-shot setting on the COLLIE-v1 dataset. Therefore, it does not describe train/validation dataset splits as part of its experimental setup for training models. |
| Hardware Specification | No | The paper states 'All experiments were run in July, 2023.' but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python version, library versions) used for conducting the experiments. |
| Experiment Setup | Yes | Our main experiments in this paper focus on a zero-shot prompting setup... By default, we use a sampling temperature of 0.7, and sample multiple trials (20 for GPT/Pa LM, 5 for Alpaca/Vicuna). |