Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label Descriptions

Authors: Sachin Kumar, Chan Young Park, Yulia Tsvetkov

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate GEN-Z by conducting experiments with six open-source language model families (GPT2, OPT, Pythia, GPT-J, Llama, and Llama2) with models ranging from 125M to 13B parameters on 19 semantic text classification tasks (comprising sentiment, topic, hate speech, and emotion classification).
Researcher Affiliation Collaboration Sachin Kumar Allen Institute for AI Seattle, WA sachink@allenai.org; Chan Young Park Carnegie Mellon University Pittsburgh, PA chanyoun@cs.cmu.edu; Yulia Tsvetkov University of Washington Seattle, WA yuliats@cs.washington.edu
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes We provide the code to reproduce our results at: https://github.com/Sachin19/ generative-classification/
Open Datasets Yes We evaluate on 18 text classification datasets encompassing diverse tasks, domains, and difficulty levels. ... Table 4 in the appendix summarizes the details of each dataset we use.
Dataset Splits No The paper states, 'We measure performance using publicly available validation or test sets, without using the training data at all.' However, it does not specify exact split percentages or sample counts for these validation sets, which are necessary for full reproducibility.
Hardware Specification No The paper mentions 'high computational requirements' and 'consumer hardware' but does not specify any particular GPU or CPU models, memory sizes, or other specific hardware configurations used for the experiments.
Software Dependencies No The paper mentions using various language model families (e.g., GPT2, OPT, Pythia, Llama) but does not provide specific version numbers for any underlying software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA) that would be needed for reproducibility.
Experiment Setup Yes For each task, we manually write one minimal label description per label using a template (see complete list in Table 5). We then generate 20 paraphrases of each label description by querying Chat GPT. ... For each dataset, we run the evaluation ten times where in each run we subsample 1 n 10 paraphrases from this set. We evaluate all methods using macro-F1 score and report mean and standard deviation across these runs.