reproducibilityindex.ai

Describing Differences between Text Distributions with Natural Language

Authors: Ruiqi Zhong, Charlie Snell, Dan Klein, Jacob Steinhardt

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On a benchmark of 54 real-world binary classification tasks, while GPT-3 Curie (13B) only generates a description similar to human annotation 7% of the time, the performance reaches 61% with fine-tuning and reranking, and our best system using GPT-3 Davinci (175B) reaches 76%.
Researcher Affiliation	Academia	Ruiqi Zhong 1 Charlie Snell 1 Dan Klein 1 Jacob Steinhardt 1 1Computer Science Division, University of California, Berkeley. Correspondence to: Ruiqi Zhong <ruiqi-zhong@berkeley.edu>.
Pseudocode	No	The paper describes its methods and framework through textual descriptions and diagrams (e.g., Figure 2), but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code.
Open Source Code	Yes	We release our code and data with the following link https://github.com/ruiqi-zhong/DescribeDistributionalDifferences.
Open Datasets	Yes	The evaluation set of Zhong et al. (2021) aggregated 54 diverse binary text classification tasks, each annotated with one or multiple4 natural language descriptions s for the positive class. These tasks include topic classification, grammaticality classification, stance classification, etc. ... The 54 binary tasks are from Maas et al. (2011), Yin et al. (2019), Barbieri et al. (2020), Zhang et al. (2015), Yin et al. (2019), Warstadt et al. (2018), Almeida et al. (2013), Pang & Lee (2004), Li & Roth (2002), Mihaylova et al. (2019), and an abstract classification dataset12.
Dataset Splits	No	The paper states it benchmarks on 54 real-world binary classification tasks and uses positive/negative class inputs as D1/D0. It also mentions creating a fine-tuning dataset and sampling pairs for evaluating hypotheses. However, it does not explicitly provide specific train/validation/test splits (percentages, sample counts, or clear predefined split references) for the 54 benchmark datasets used in their evaluation, which would be necessary to reproduce the exact data partitioning for their experiments.
Hardware Specification	No	The paper mentions that experiments used GPT-3 Davinci, GPT-3 Curie, and T5 models, and acknowledges 'the TPU Research Cloud (TRC) program for providing computational resources,' but it does not specify concrete hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or specific TPU versions (e.g., TPU v2).
Software Dependencies	No	The paper mentions the use of specific large language models like GPT-3 Davinci (175B), GPT-3 Curie (13B), Unified QA (based on T5 11B), and RoBERTa-Large. However, it does not provide explicit software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8', 'CUDA 11.1') necessary for full reproducibility.
Experiment Setup	Yes	Proposer. ...We used batch size 20 and a small learning rate of 0.05 to prevent memorizing the target. We fine-tuned for two epochs... Verifier. ...We fine-tuned Unified QA on this dataset for 250 steps with batch size 32 and learning rate 5e-5.