The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators

Authors: Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments We study the capability of Alchemist empirically. Our goals are to validate the following claims: Cost Reduction and Improved Performance (Sec. 4.1): Alchemist can reduce cost by orders of magnitude, while producing labels of similar or better accuracy. Extendibility to Other Modalities (Sec. 4.2): Alchemist can operate with modalities beyond text. Use of Supplementary Information (Sec. 4.3): Incorporating relevant information into prompts enables the generation of better programs, yielding more accurate pseudolabels. More Diverse Programs Can Help (Sec. 4.4): Increasing the diversity of generated programs created by different labeling logic enables better pseudo labels. Comparing to Human-crafted Programs (Sec 4.5): Synthesized programs may be more effective in comparison to human-crafted ones.
Researcher Affiliation Academia Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala University of Wisconsin-Madison {thuang273, ccao35, vbhargava3}@wisc.edu, fredsala@cs.wisc.edu
Pseudocode No The paper shows examples of generated programs (code snippets) in Figure 1, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block for its own method.
Open Source Code Yes We release our code here: https://github.com/Sprocket Lab/Alchemist.
Open Datasets Yes Datasets. We include diverse datasets covering text and image modalities. For text, we include eight datasets that span three different types of language tasks. These include the You Tube [43], SMS [44] datasets for spam classification, IMDb [45], Yelp [45], Finance [46], and French [47] datasets for sentiment analysis, and the Med Abs [48] and Cancer [14] datasets for topic classification. ... Table 5: Dataset Table. We place more details about our datasets and experimental setups here. First, in Table 5 we show task type, prediction classes, and number of training data points in each dataset. ... You Tube [43] spam comment detection { spam , ham } 2 1686 SMS [44] spam text detection { spam , ham } 2 4571 Yelp [45] restaurant review sentiment classification { postive , negative } 2 30400 IMDb [45] movie review sentiment classification { postive , negative } 2 20000 Med Abs [48] medical abstract topic classification { neoplasms , digestive system diseases , nervous system diseases , cardiovascular diseases , general pathological conditions } 5 10395 Cancer [14] biomedical document topic classification { colon cancer , lung cancer , thyroid cancer } 3 5450 Finance [46] finance news sentiment classification { positive , neutral , negative } 3 3488 French [47] book review sentiment classification { positive , neutral , negative } 3 6953 Waterbirds [42] bird species classification { landbird , waterbird } 2 5794
Dataset Splits No The paper states it uses training data and evaluates on testing datasets but does not explicitly detail the size or percentage of a separate validation set or how it was derived for its experiments beyond an implicit mention that Alchemist *can* use one.
Hardware Specification Yes We use a A6000 NVidia GPU to run all experiments.
Software Dependencies No The paper mentions software components like Python, GPT-4, Claude 3, CLIP, and Snorkel, but it does not specify exact version numbers for these software dependencies.
Experiment Setup Yes All the distilled models use the MLP model that is trained with 2 hidden layers, each comprising 32 units, using Re LU activations between layers and no normalization. We run 5 times with different random seeds and report their average performance.