Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Goal Driven Discovery of Distributional Differences via Language Descriptions

Authors: Ruiqi Zhong, Peter Zhang, Steve Li, Jinwoo Ahn, Dan Klein, Jacob Steinhardt

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To quantitatively evaluate its performance, we 1) build a diagnostic benchmark, SYND5, to test whether it can recover known differences between two synthetic corpora, and 2) contribute a meta-dataset, OPEND5, aggregating 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health.
Researcher Affiliation	Academia	University of California, Berkeley, EECS Department. Email: EMAIL
Pseudocode	No	The paper describes algorithms and a pipeline (Figure 5) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured code-like steps.
Open Source Code	Yes	Our code is released at https://github.com/ruiqi-zhong/D5 and our code to download OPEND5 is released at https://github.com/petezh/Open D5.
Open Datasets	Yes	Our code to download OPEND5 is released at https://github.com/petezh/Open D5.
Dataset Splits	Yes	We use 50% of each corpus as the exploration split and 50% as the validation split.
Hardware Specification	Yes	We ran the Flan-T5 based validator for 2 hours on 1 80G A100 GPUs.
Software Dependencies	No	The paper mentions several language models used (e.g., gpt-3, Flan-T5, gpt-4, Claude-v1.3) and indicates that Flan-T5 was fine-tuned and a citation (Chung et al., 2022) is provided. It also mentions the NLTK package (Bird et al., 2009). However, specific version numbers for these software dependencies are not consistently provided (e.g., specific version of gpt-3 or NLTK).
Experiment Setup	Yes	We prompt gpt-3 (Ouyang et al., 2022) to propose hypotheses. Denoting the exploration split of Corpus A/B as Dexp A /Dexp B , we construct the prompt by concatenating a few random samples from Dexp A and Dexp B , the exploration goal, and an instruction to output a list of hypotheses. Figure 3 (left) depicts an example of the resulting prompt, together with a typical language model output. ... We continue sampling hypotheses with different prompts until obtaining a set of 60 hypotheses... rule out the hypotheses with p greater than 0.001.