End-To-End Memory Networks

Authors: Sainbayar Sukhbaatar, arthur szlam, Jason Weston, Rob Fergus

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments on the synthetic QA tasks defined in [22] (using version 1.1 of the dataset). For the former our approach is competitive with Memory Networks, but with less supervision. For the latter, on the Penn Tree Bank and Text8 datasets our approach demonstrates comparable performance to RNNs and LSTMs. In both cases we show that the key concept of multiple computational hops yields improved results.
Researcher Affiliation Collaboration Sainbayar Sukhbaatar Dept. of Computer Science Courant Institute, New York University sainbar@cs.nyu.edu Arthur Szlam Jason Weston Rob Fergus Facebook AI Research New York {aszlam,jase,robfergus}@fb.com
Pseudocode No The paper includes diagrams of the model but no pseudocode or algorithm blocks.
Open Source Code Yes Mem N2N source code is available at https://github.com/facebook/Mem NN.
Open Datasets Yes We perform experiments on the synthetic QA tasks defined in [22] (using version 1.1 of the dataset). Penn Treebank [13]: This consists of 929k/73k/82k train/validation/test words, distributed over a vocabulary of 10k words. The same preprocessing as [25] was used. Text8 [15]: This is a a pre-processed version of the first 100M million characters, dumped from Wikipedia.
Dataset Splits Yes 10% of the b Ab I training set was held-out to form a validation set, which was used to select the optimal model architecture and hyperparameters. Penn Treebank [13]: This consists of 929k/73k/82k train/validation/test words... Text8 [15]: This is a a pre-processed version of the first 100M million characters, dumped from Wikipedia. This is split into 93.3M/5.7M/1M character train/validation/test sets.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes Our models were trained using a learning rate of η = 0.01, with anneals every 25 epochs by η/2 until 100 epochs were reached. No momentum or weight decay was used. The weights were initialized randomly from a Gaussian distribution with zero mean and σ = 0.1. All training uses a batch size of 32... gradients with an ℓ2 norm larger than 40 are divided by a scalar to have norm 40. In LS training, the initial learning rate is set to η = 0.005. For each mini-batch update, the ℓ2 norm of the whole gradient of all parameters is measured5 and if larger than L = 50, then it is scaled down to have norm L. We use the learning rate annealing schedule from [15], namely, if the validation cost has not decreased after one epoch, then the learning rate is scaled down by a factor 1.5. Training terminates when the learning rate drops below 10 5, i.e. after 50 epochs or so. Weights are initialized using N(0, 0.05) and batch size is set to 128.