Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
COLLIE: Systematic Construction of Constrained Text Generation Tasks
Authors: Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, Karthik R Narasimhan
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings. |
| Researcher Affiliation | Academia | Department of Computer Science, Princeton University EMAIL |
| Pseudocode | Yes | Figure 2: Example COLLIE code for a simple number of words constraint. |
| Open Source Code | Yes | Project site with code and data: https://collie-benchmark.github.io. |
| Open Datasets | Yes | We extract constraint targets from three distinct data sources: Wikipedia (Wiki) (Foundation, 2022), Common Crawl News (CC-News) (Hamborg et al., 2017), and the Project Gutenberg Corpus (Guten) (Brooke et al., 2015). |
| Dataset Splits | No | The paper evaluates pre-trained language models in a zero-shot setting on the COLLIE-v1 dataset. Therefore, it does not describe train/validation dataset splits as part of its experimental setup for training models. |
| Hardware Specification | No | The paper states 'All experiments were run in July, 2023.' but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python version, library versions) used for conducting the experiments. |
| Experiment Setup | Yes | Our main experiments in this paper focus on a zero-shot prompting setup... By default, we use a sampling temperature of 0.7, and sample multiple trials (20 for GPT/Pa LM, 5 for Alpaca/Vicuna). |