Describing Differences between Text Distributions with Natural Language
Authors: Ruiqi Zhong, Charlie Snell, Dan Klein, Jacob Steinhardt
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a benchmark of 54 real-world binary classification tasks, while GPT-3 Curie (13B) only generates a description similar to human annotation 7% of the time, the performance reaches 61% with fine-tuning and reranking, and our best system using GPT-3 Davinci (175B) reaches 76%. |
| Researcher Affiliation | Academia | Ruiqi Zhong 1 Charlie Snell 1 Dan Klein 1 Jacob Steinhardt 1 1Computer Science Division, University of California, Berkeley. Correspondence to: Ruiqi Zhong <ruiqi-zhong@berkeley.edu>. |
| Pseudocode | No | The paper describes its methods and framework through textual descriptions and diagrams (e.g., Figure 2), but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps formatted like code. |
| Open Source Code | Yes | We release our code and data with the following link https://github.com/ruiqi-zhong/DescribeDistributionalDifferences. |
| Open Datasets | Yes | The evaluation set of Zhong et al. (2021) aggregated 54 diverse binary text classification tasks, each annotated with one or multiple4 natural language descriptions s for the positive class. These tasks include topic classification, grammaticality classification, stance classification, etc. ... The 54 binary tasks are from Maas et al. (2011), Yin et al. (2019), Barbieri et al. (2020), Zhang et al. (2015), Yin et al. (2019), Warstadt et al. (2018), Almeida et al. (2013), Pang & Lee (2004), Li & Roth (2002), Mihaylova et al. (2019), and an abstract classification dataset12. |
| Dataset Splits | No | The paper states it benchmarks on 54 real-world binary classification tasks and uses positive/negative class inputs as D1/D0. It also mentions creating a fine-tuning dataset and sampling pairs for evaluating hypotheses. However, it does not explicitly provide specific train/validation/test splits (percentages, sample counts, or clear predefined split references) for the 54 benchmark datasets used in their evaluation, which would be necessary to reproduce the exact data partitioning for their experiments. |
| Hardware Specification | No | The paper mentions that experiments used GPT-3 Davinci, GPT-3 Curie, and T5 models, and acknowledges 'the TPU Research Cloud (TRC) program for providing computational resources,' but it does not specify concrete hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or specific TPU versions (e.g., TPU v2). |
| Software Dependencies | No | The paper mentions the use of specific large language models like GPT-3 Davinci (175B), GPT-3 Curie (13B), Unified QA (based on T5 11B), and RoBERTa-Large. However, it does not provide explicit software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8', 'CUDA 11.1') necessary for full reproducibility. |
| Experiment Setup | Yes | Proposer. ...We used batch size 20 and a small learning rate of 0.05 to prevent memorizing the target. We fine-tuned for two epochs... Verifier. ...We fine-tuned Unified QA on this dataset for 250 steps with batch size 32 and learning rate 5e-5. |