Goal Driven Discovery of Distributional Differences via Language Descriptions

Authors: Ruiqi Zhong, Peter Zhang, Steve Li, Jinwoo Ahn, Dan Klein, Jacob Steinhardt

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To quantitatively evaluate its performance, we 1) build a diagnostic benchmark, SYND5, to test whether it can recover known differences between two synthetic corpora, and 2) contribute a meta-dataset, OPEND5, aggregating 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health.
Researcher Affiliation Academia University of California, Berkeley, EECS Department. Email: ruiqi-zhong@berkeley.edu
Pseudocode No The paper describes algorithms and a pipeline (Figure 5) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured code-like steps.
Open Source Code Yes Our code is released at https://github.com/ruiqi-zhong/D5 and our code to download OPEND5 is released at https://github.com/petezh/Open D5.
Open Datasets Yes Our code to download OPEND5 is released at https://github.com/petezh/Open D5.
Dataset Splits Yes We use 50% of each corpus as the exploration split and 50% as the validation split.
Hardware Specification Yes We ran the Flan-T5 based validator for 2 hours on 1 80G A100 GPUs.
Software Dependencies No The paper mentions several language models used (e.g., gpt-3, Flan-T5, gpt-4, Claude-v1.3) and indicates that Flan-T5 was fine-tuned and a citation (Chung et al., 2022) is provided. It also mentions the NLTK package (Bird et al., 2009). However, specific version numbers for these software dependencies are not consistently provided (e.g., specific version of gpt-3 or NLTK).
Experiment Setup Yes We prompt gpt-3 (Ouyang et al., 2022) to propose hypotheses. Denoting the exploration split of Corpus A/B as Dexp A /Dexp B , we construct the prompt by concatenating a few random samples from Dexp A and Dexp B , the exploration goal, and an instruction to output a list of hypotheses. Figure 3 (left) depicts an example of the resulting prompt, together with a typical language model output. ... We continue sampling hypotheses with different prompts until obtaining a set of 60 hypotheses... rule out the hypotheses with p greater than 0.001.