reproducibilityindex.ai

Deep contextualized word representations

Authors: Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that ELMo representations work extremely well in practice. We ﬁrst show that they can be easily added to existing models for six diverse and challenging language understanding problems, including textual entailment, question answering and sentiment analysis. The addition of ELMo representations alone signiﬁcantly improves the state of the art in every case, including up to 20% relative error reductions.4 EVALUATION Table 1 shows the performance of ELMo across a diverse set of six benchmark NLP tasks.
Researcher Affiliation	-1	Anonymous authors Paper under double-blind review
Pseudocode	No	The paper describes methods and equations (e.g., in Section 3), but does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	Our trained models and code will be made publicly available, and we expect that ELMo will provide similar gains for many other NLP problems.1 http://anonymous
Open Datasets	Yes	The Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) provides approximately 550K hypothesis/premise pairs.The Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) contains 100K+ crowd sourced question-answer pairs...a new state-of-the-art on the Onto Notes benchmark (Pradhan et al., 2013)...Onto Notes coreference annotations from the Co NLL 2012 shared task (Pradhan et al., 2012)...The Co NLL 2003 NER task (Sang & Meulder, 2003)...The ﬁne-grained sentiment classiﬁcation task in the Stanford Sentiment Treebank (SST-5; Socher et al., 2013)
Dataset Splits	Yes	Once ﬁne tuned, the bi LM weights were ﬁxed during task training. Table 7 lists the development set perplexities for the considered tasks. In every case except Co NLL 2012, ﬁne tuning results in a large improvement in perplexity, e.g., from 72.1 to 16.8 for SNLI. Models are trained for 500 epochs or until validation F1 does not improve for 200 epochs, whichever is sooner.
Hardware Specification	No	The paper describes model architectures and sizes but does not provide specific details on the hardware used for training or experimentation, such as GPU/CPU models or memory.
Software Dependencies	No	The paper mentions optimizers like Adam and Adadelta and libraries such as GloVe, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	For regularization, we added 50% variational dropout (Gal & Ghahramani, 2016) to the input of each LSTM layer and 50% dropout (Srivastava et al., 2014) at the input to the ﬁnal two fully connected layers. All feed forward layers use Re LU activations. Parameters were optimized using Adam (Kingma & Ba, 2015) with gradient norms clipped at 5.0 and initial learning rate 0.0004, decreasing by half each time accuracy on the development set did not increase in subsequent epochs. The batch size was 32.