reproducibilityindex.ai

COLLIE: Systematic Construction of Constrained Text Generation Tasks

Authors: Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, Karthik R Narasimhan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform systematic experiments across ﬁve state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings.
Researcher Affiliation	Academia	Department of Computer Science, Princeton University {shunyuy, hc22, hjwang, runzhey, karthikn}@princeton.edu
Pseudocode	Yes	Figure 2: Example COLLIE code for a simple number of words constraint.
Open Source Code	Yes	Project site with code and data: https://collie-benchmark.github.io.
Open Datasets	Yes	We extract constraint targets from three distinct data sources: Wikipedia (Wiki) (Foundation, 2022), Common Crawl News (CC-News) (Hamborg et al., 2017), and the Project Gutenberg Corpus (Guten) (Brooke et al., 2015).
Dataset Splits	No	The paper evaluates pre-trained language models in a zero-shot setting on the COLLIE-v1 dataset. Therefore, it does not describe train/validation dataset splits as part of its experimental setup for training models.
Hardware Specification	No	The paper states 'All experiments were run in July, 2023.' but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python version, library versions) used for conducting the experiments.
Experiment Setup	Yes	Our main experiments in this paper focus on a zero-shot prompting setup... By default, we use a sampling temperature of 0.7, and sample multiple trials (20 for GPT/Pa LM, 5 for Alpaca/Vicuna).