The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators
Authors: Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments We study the capability of Alchemist empirically. Our goals are to validate the following claims: Cost Reduction and Improved Performance (Sec. 4.1): Alchemist can reduce cost by orders of magnitude, while producing labels of similar or better accuracy. Extendibility to Other Modalities (Sec. 4.2): Alchemist can operate with modalities beyond text. Use of Supplementary Information (Sec. 4.3): Incorporating relevant information into prompts enables the generation of better programs, yielding more accurate pseudolabels. More Diverse Programs Can Help (Sec. 4.4): Increasing the diversity of generated programs created by different labeling logic enables better pseudo labels. Comparing to Human-crafted Programs (Sec 4.5): Synthesized programs may be more effective in comparison to human-crafted ones. |
| Researcher Affiliation | Academia | Tzu-Heng Huang, Catherine Cao, Vaishnavi Bhargava, Frederic Sala University of Wisconsin-Madison {thuang273, ccao35, vbhargava3}@wisc.edu, fredsala@cs.wisc.edu |
| Pseudocode | No | The paper shows examples of generated programs (code snippets) in Figure 1, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block for its own method. |
| Open Source Code | Yes | We release our code here: https://github.com/Sprocket Lab/Alchemist. |
| Open Datasets | Yes | Datasets. We include diverse datasets covering text and image modalities. For text, we include eight datasets that span three different types of language tasks. These include the You Tube [43], SMS [44] datasets for spam classification, IMDb [45], Yelp [45], Finance [46], and French [47] datasets for sentiment analysis, and the Med Abs [48] and Cancer [14] datasets for topic classification. ... Table 5: Dataset Table. We place more details about our datasets and experimental setups here. First, in Table 5 we show task type, prediction classes, and number of training data points in each dataset. ... You Tube [43] spam comment detection { spam , ham } 2 1686 SMS [44] spam text detection { spam , ham } 2 4571 Yelp [45] restaurant review sentiment classification { postive , negative } 2 30400 IMDb [45] movie review sentiment classification { postive , negative } 2 20000 Med Abs [48] medical abstract topic classification { neoplasms , digestive system diseases , nervous system diseases , cardiovascular diseases , general pathological conditions } 5 10395 Cancer [14] biomedical document topic classification { colon cancer , lung cancer , thyroid cancer } 3 5450 Finance [46] finance news sentiment classification { positive , neutral , negative } 3 3488 French [47] book review sentiment classification { positive , neutral , negative } 3 6953 Waterbirds [42] bird species classification { landbird , waterbird } 2 5794 |
| Dataset Splits | No | The paper states it uses training data and evaluates on testing datasets but does not explicitly detail the size or percentage of a separate validation set or how it was derived for its experiments beyond an implicit mention that Alchemist *can* use one. |
| Hardware Specification | Yes | We use a A6000 NVidia GPU to run all experiments. |
| Software Dependencies | No | The paper mentions software components like Python, GPT-4, Claude 3, CLIP, and Snorkel, but it does not specify exact version numbers for these software dependencies. |
| Experiment Setup | Yes | All the distilled models use the MLP model that is trained with 2 hidden layers, each comprising 32 units, using Re LU activations between layers and no normalization. We run 5 times with different random seeds and report their average performance. |