Reasoning with Memory Augmented Neural Networks for Language Comprehension

Authors: Tsendsuren Munkhdalai, Hong Yu

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We applied the proposed approach to language comprehension task by using Neural Semantic Encoders (NSE). Our NSE models achieved the state-of-the-art results showing an absolute improvement of 1.2% to 2.6% accuracy over previous results obtained by single and ensemble systems on standard machine comprehension benchmarks such as the Children s Book Test (CBT) and Who-Did-What (WDW) news article datasets.
Researcher Affiliation Academia Tsendsuren Munkhdalai & Hong Yu University of Massachusetts Medical School Bedford VAMC
Pseudocode No The paper describes the proposed approach in detail using mathematical equations and textual descriptions, but it does not provide any pseudocode or algorithm blocks.
Open Source Code Yes More detail on hyperparameters can be found in our code: https://bitbucket.org/tsendeemts/nse-rc
Open Datasets Yes We evaluated our models on two large-scale datasets: Childrens Book Test (CBT) (Hill et al., 2015) and Who-Did-What (WDW) (Onishi et al., 2016).
Dataset Splits Yes Table 1: Statistics of the datasets. train (s): train strict, train (r): train relaxed and cands: candidates. WDW CBT-NE CBT-CN train (s) train (r) dev test train dev test train dev test
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments.
Software Dependencies No The paper mentions using 'A pre-trained 300-D Glove 840B vectors (Pennington et al., 2014)' but does not specify version numbers for any other software dependencies, libraries, or frameworks used.
Experiment Setup Yes We used stochastic gradient descent with an Adam optimizer to train the models. The initial learning rate (lr) was set to 0.0005 for CBT-CN or 0.001 for other tasks. A pre-trained 300-D Glove 840B vectors (Pennington et al., 2014) were used to initialize the word embedding layer2; therefore the embedding layer size is 300. The hidden layer size of the context embedding Bi LSTM nets k = 436. The embeddings for out-of-vocabulary words and the model parameters were randomly initialized from the uniform distribution over [-0.1, 0.1). The gradient clipping threshold was set to 15. The models were regularized by applying 20% dropouts to the embedding layer3. We used the batch size n = 32 for the CBT dataset and n = 25 for the WDW dataset and early stopping with a patience of 1 epoch.