Data Programming: Creating Large Training Sets, Quickly

Authors: Alexander J. Ratner, Christopher M. De Sa, Sen Wu, Daniel Selsam, Christopher Ré

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition).
Researcher Affiliation Academia Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré Stanford University {ajratner,cdesa,senwu,dselsam,chrismre}@stanford.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes To test this, we arranged a hackathon involving a handful of bioinformatics researchers, using our open-source information extraction framework Snorkel4 (formerly DDLite). 4snorkel.stanford.edu
Open Datasets Yes We examine a news application from the 2014 TAC-KBP Slot Filling challenge2, where we extract relations between real-world entities from articles [2]; a clinical genomics application, where we extract causal relations between genetic mutations and phenotypes from the scientific literature3; and a pharmacogenomics application where we extract interactions between genes, also from the scientific literature [21]; further details are included in the Appendix. 2http://www.nist.gov/tac/2014/KBP/ 3https://github.com/HazyResearch/dd-genomics
Dataset Splits No The paper states 'For all experiments, we evaluated on a blind hand-labeled evaluation set' but does not provide specific details on training/validation splits, proportions, or methods like cross-validation.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions software like 'LSTM' and 'Snorkel (formerly DDLite)' but does not specify version numbers for any software dependencies, libraries, or frameworks used.
Experiment Setup No The paper describes the general approach and feature generation methods but does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings.