Medical Synonym Extraction with Concept Space Models

Authors: Chang Wang, Liangliang Cao, Bowen Zhou

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a dataset with more than 1M term pairs show that the proposed approach outperforms the baseline approaches by a large margin. The experimental results show that our synonym extraction models are fast and outperform the state-of-the-art approaches on medical synonym extraction by a large margin.
Researcher Affiliation Industry Chang Wang and Liangliang Cao and Bowen Zhou IBM T. J. Watson Research Lab 1101 Kitchawan Rd Yorktown Heights, New York 10598 {changwangnk, liangliang.cao}@gmail.com, zhou@us.ibm.com
Pseudocode No The paper presents mathematical derivations for update rules (equations 1 and 2) and defines notations (Figure 2), but it does not include a block explicitly labeled 'Pseudocode' or 'Algorithm' showing step-by-step procedures.
Open Source Code No The paper does not contain any explicit statement about providing open-source code for the described methodology or a link to a code repository.
Open Datasets Yes Our medical corpus has incorporated a set of Wikipedia articles and MEDLINE abstracts (2013 version)1. We also complemented these sources with around 20 medical journals and books like Merck Manual of Diagnosis and Therapy. In total, the corpus contains about 130M sentences (about 20G pure text), and about 15M distinct terms in the vocabulary set. The UMLS 2012 Release contains more than 2.7 million concepts from over 160 source vocabularies. 1http://www.nlm.nih.gov/bsd/pmresources.html [Lindberg et al., 1993] D. Lindberg, B. Humphreys, and A. Mc Cray. The Unified Medical Language System. Methods of Information in Medicine, 32:281 291, 1993.
Dataset Splits Yes The final dataset was split into 3 parts: 60% examples were used for training, 20% were used for testing the classifiers, and the remaining 20% were held out to evaluate the knowledgebase construction results.
Hardware Specification Yes It took on average several hours to generate the word embedding file from our medical corpus with 20G text using 16 3.2G cpus and roughly 30 minutes to finish the training process using one cpu.
Software Dependencies No The paper mentions software like 'Word2Vec model', 'liblinear package', and 'Medical ESG parser', but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes The parameters used in the experiments were: dimension size=100, window size=5, negative=10, and sample rate=1e-5. In all the experiments, the weight for the positive examples was set to 100, due to the fact that most of the input examples were negative.