Empower Sequence Labeling with Task-Aware Neural Language Model
Authors: Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Xu, Huan Gui, Jian Peng, Jiawei Han
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on benchmark datasets demonstrate the effectiveness of leveraging character-level knowledge and the efficiency of co-training. For example, on the Co NLL03 NER task, model training completes in about 6 hours on a single GPU, reaching F1 score of 91.71 0.10 without using any extra annotations. |
| Researcher Affiliation | Collaboration | Liyuan Liu, Jingbo Shang, Xiang Ren, Frank F. Xu, Huan Gui, Jian Peng, Jiawei Han University of Illinois at Urbana-Champaign, {ll2, shang7, jianpeng, hanj}@illinois.edu University of Southern California, xiangren@usc.edu Shanghai Jiao Tong University, frankxu@sjtu.edu.cn Facebook, huangui@fb.com |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | We implement LM-LSTM-CRF5 based on the PyTorch library6. Models has been trained on one GeForce GTX 1080 GPU, with training time recorded in Table 8. ... 5https://github.com/Liyuan Lucas Liu/LM-LSTM-CRF |
| Open Datasets | Yes | We conduct experiments on the Co NLL 2003 NER task, the Co NLL 2000 chunking task, as well as the WSJ portion of the Penn Treebank POS tagging task. ... The corpus statistics are summarized in Table 2. |
| Dataset Splits | Yes | Co NLL03 NER contains annotations for four entity types: PER, LOC, ORG, and MISC. It has been separated into training, development and test sets. ... Co NLL00 chunking defines eleven syntactic chunk types (e.g., NP, VP) in addition to Other. It only includes training and test sets. Following previous works (Peters et al. 2017), we sampled 1000 sentences from training set as a held-out development set. WSJ contains 25 sections and categorizes each word into 45 POS tags. We adopt the standard split and use sections 0-18 as training data, sections 19-21 as development data, and sections 22-24 as test data (Manning 2011). |
| Hardware Specification | Yes | We implement LM-LSTM-CRF5 based on the PyTorch library6. Models has been trained on one Ge Force GTX 1080 GPU, with training time recorded in Table 8. |
| Software Dependencies | No | The paper mentions 'PyTorch library' but does not specify a version number. No other software dependencies are listed with specific versions. |
| Experiment Setup | Yes | For a fair comparison, we didn t spend much time on tuning parameters but borrow the initialization, optimization method, and all related hyper-parameter values (except the state size of LSTM) from the previous work (Ma and Hovy 2016). For the hidden state size of LSTM, we expand it from 200 to 300, because introducing additional knowledge allows us to train a larger network. We will further discuss this change later. Table 3 summarizes some important hyperparameters. Since the Co NLL00 is similar to the Co NLL03 NER dataset, we conduct experiments with the same parameters on both tasks. Initialization. We use Glo Ve 100-dimension pre-trained word embeddings released by Stanford1 and randomly initialize the other parameters (Glorot and Bengio 2010; Jozefowicz, Zaremba, and Sutskever 2015). Optimization. We employ mini-batch stochastic gradient descent with momentum. The batch size, the momentum and the learning rate are set to 10, 0.9 and ηt = η0 1+ρt, where η0 is the initial learning rate and ρ = 0.05 is the decay ratio. Dropout is applied in our model, and its ratio is fixed to 0.5. To increase stability, we use gradient clipping of 5.0. Network Structure. The hyper-parameters of character-level LSTM are set to the same value of word-level bi LSTM. We fix the depth of highway layers as 1 to avoid an over-complicated model. |