Deep contextualized word representations

Authors: Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that ELMo representations work extremely well in practice. We first show that they can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering and sentiment analysis. The addition of ELMo representations alone significantly improves the state of the art in every case, including up to 20% relative error reductions.4 EVALUATION Table 1 shows the performance of ELMo across a diverse set of six benchmark NLP tasks.
Researcher Affiliation -1 Anonymous authors Paper under double-blind review
Pseudocode No The paper describes methods and equations (e.g., in Section 3), but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No Our trained models and code will be made publicly available, and we expect that ELMo will provide similar gains for many other NLP problems.1 http://anonymous
Open Datasets Yes The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) provides approximately 550K hypothesis/premise pairs.The Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) contains 100K+ crowd sourced question-answer pairs...a new state-of-the-art on the Onto Notes benchmark (Pradhan et al., 2013)...Onto Notes coreference annotations from the Co NLL 2012 shared task (Pradhan et al., 2012)...The Co NLL 2003 NER task (Sang & Meulder, 2003)...The fine-grained sentiment classification task in the Stanford Sentiment Treebank (SST-5; Socher et al., 2013)
Dataset Splits Yes Once fine tuned, the bi LM weights were fixed during task training. Table 7 lists the development set perplexities for the considered tasks. In every case except Co NLL 2012, fine tuning results in a large improvement in perplexity, e.g., from 72.1 to 16.8 for SNLI. Models are trained for 500 epochs or until validation F1 does not improve for 200 epochs, whichever is sooner.
Hardware Specification No The paper describes model architectures and sizes but does not provide specific details on the hardware used for training or experimentation, such as GPU/CPU models or memory.
Software Dependencies No The paper mentions optimizers like Adam and Adadelta and libraries such as GloVe, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes For regularization, we added 50% variational dropout (Gal & Ghahramani, 2016) to the input of each LSTM layer and 50% dropout (Srivastava et al., 2014) at the input to the final two fully connected layers. All feed forward layers use Re LU activations. Parameters were optimized using Adam (Kingma & Ba, 2015) with gradient norms clipped at 5.0 and initial learning rate 0.0004, decreasing by half each time accuracy on the development set did not increase in subsequent epochs. The batch size was 32.