Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label Descriptions
Authors: Sachin Kumar, Chan Young Park, Yulia Tsvetkov
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate GEN-Z by conducting experiments with six open-source language model families (GPT2, OPT, Pythia, GPT-J, Llama, and Llama2) with models ranging from 125M to 13B parameters on 19 semantic text classification tasks (comprising sentiment, topic, hate speech, and emotion classification). |
| Researcher Affiliation | Collaboration | Sachin Kumar Allen Institute for AI Seattle, WA sachink@allenai.org; Chan Young Park Carnegie Mellon University Pittsburgh, PA chanyoun@cs.cmu.edu; Yulia Tsvetkov University of Washington Seattle, WA yuliats@cs.washington.edu |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | We provide the code to reproduce our results at: https://github.com/Sachin19/ generative-classification/ |
| Open Datasets | Yes | We evaluate on 18 text classification datasets encompassing diverse tasks, domains, and difficulty levels. ... Table 4 in the appendix summarizes the details of each dataset we use. |
| Dataset Splits | No | The paper states, 'We measure performance using publicly available validation or test sets, without using the training data at all.' However, it does not specify exact split percentages or sample counts for these validation sets, which are necessary for full reproducibility. |
| Hardware Specification | No | The paper mentions 'high computational requirements' and 'consumer hardware' but does not specify any particular GPU or CPU models, memory sizes, or other specific hardware configurations used for the experiments. |
| Software Dependencies | No | The paper mentions using various language model families (e.g., GPT2, OPT, Pythia, Llama) but does not provide specific version numbers for any underlying software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA) that would be needed for reproducibility. |
| Experiment Setup | Yes | For each task, we manually write one minimal label description per label using a template (see complete list in Table 5). We then generate 20 paraphrases of each label description by querying Chat GPT. ... For each dataset, we run the evaluation ten times where in each run we subsample 1 n 10 paraphrases from this set. We evaluate all methods using macro-F1 score and report mean and standard deviation across these runs. |