EZLearn: Exploiting Organic Supervision in Automated Data Annotation

Authors: Maxim Grechkin, Hoifung Poon, Bill Howe

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To investigate the effectiveness and generality of EZLearn, we applied it to two important applications: functional genomics and scientific figure comprehension, which differ substantially in sample input dimension and description length. In functional genomics, there are thousands of relevant classes. In scientific figure comprehension, prior work only considers three coarse classes, which we expand to twenty-four. In both scenarios, EZLearn successfully learned an accurate classifier with zero manually labeled examples. While standard co-training has labeled examples from the beginning, EZLearn can only rely on distant supervision, which is inherently noisy. We investigate several ways to reconcile distant supervision with the trained classifier s predictions during co-training. We found that it generally helps to remember distant supervision while leaving room for correction, especially by accounting for the hierarchical relations among classes. We also conducted experiments to evaluate the impact of noise on EZLearn.
Researcher Affiliation Collaboration Maxim Grechkin1, Hoifung Poon2, Bill Howe1 1 University of Washington 2 Microsoft Research grechkin@uw.edu, hoifung@microsoft.com, billhowe@uw.edu
Pseudocode Yes Algorithm 1 EZLearn
Open Source Code No The paper does not include a statement about releasing the source code or provide a link to a code repository for its methodology.
Open Datasets Yes We used the standard BRENDA Tissue Ontology [Gremse et al., 2011], which contains 4931 human tissue types. For gene expression data, we used the Gene Expression Omnibus (GEO) [Edgar et al., 2002], a popular repository run by the National Center for Biotechnology Information.
Dataset Splits No The paper does not explicitly describe a validation dataset split. It mentions hyperparameter tuning ('The performance of EZLearn was not sensitive to this parameter: values in (0.2, 0.6) yielded similar results.') but does not specify a distinct validation set or its size/proportion.
Hardware Specification No The paper does not specify the hardware used for experiments, such as particular CPU or GPU models, memory, or cloud computing instances.
Software Dependencies No The paper mentions software like Keras [Chollet, 2015] and fastText [Joulin et al., 2017] but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Main classifier We implemented Trainmain using a deep denoising auto-encoder (DAE) with three Leaky Re LU layers to convert the gene expression profile to a 128-dimensional vector [Vincent et al., 2008], followed by multinomial logistic regression, trained end-to-end in Keras [Chollet, 2015], using L2 regularization with weight 1e 4 and RMSProp optimizer [Tieleman and Hinton, 2012]. Auxiliary classifier We implemented Trainaux using fast Text with their recommended parameters (25 epochs and starting learning rate of 1.0) [Joulin et al., 2017]. In all iterations, a labeled set might contain more than one class for a sample, which is not a problem for the learning algorithm and is useful when there is uncertainty about the correct class. EZLearn generates the labeled set by adding all (sample, class) pairs for which the score crosses a hyperparameter threshold. We used 0.3 in this paper, which allows up to 3 classes to be assigned to a sample. [...] In practice, the algorithm converges quickly [Nigam and Ghani, 2000], and we simply ran all experiments with five iterations.