Compositional Morphology for Word Representations and Language Modelling

Authors: Jan Botha, Phil Blunsom

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents a scalable method for integrating compositional morphological representations into a vector-based probabilistic language model. Our approach is evaluated in the context of log-bilinear language models, rendered suitably efficient for implementation inside a machine translation decoder by factoring the vocabulary. We perform both intrinsic and extrinsic evaluations, presenting results on a range of languages which demonstrate that our model learns morphological representations that both perform well on word similarity tasks and lead to substantial reductions in perplexity.
Researcher Affiliation Academia Jan A. Botha JAN.BOTHA@CS.OX.AC.UK Phil Blunsom PHIL.BLUNSOM@CS.OX.AC.UK Department of Computer Science, University of Oxford, Oxford, OX1 3QD, UK
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our source code for language model training and integration into cdec is available from http://bothameister.github.io
Open Datasets Yes We make use of data from the 2013 ACL Workshop on Machine Translation.5 We first describe data used for translation experiments, since the monolingual datasets used for language model training were derived from that. The language pairs are English {German, French, Spanish, Russian} and English Czech. Our parallel data comprised the Europarl-v7 and news-commentary corpora, except for English Russian where we used news-commentary and the Yandex parallel corpus.6
Dataset Splits Yes newstest2011 was used as development data7 for tuning language model hyperparameters, while intrinsic LM evaluation was done on newstest2012. ... For Russian, some training data was held out for tuning.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software like SRILM, Ken LM, and cdec but does not provide specific version numbers for these or any other ancillary software components.
Experiment Setup Yes L=10k 40k, ξ=0.05 0.08, dependent on |V| and data size. ... Bias terms b (resp. t) are initialised to the log unigram probabilities of words (resp. classes) in the training corpus, with Laplace smoothing, while all other parameters are initialised randomly according to sharp, zero-mean Gaussians. ... Optimisation is performed by stochastic gradient descent with updates after each mini-batch of L training examples. We apply Ada Grad (Duchi et al., 2011) and tune the stepsize ξ on development data.4 We halt training once the perplexity on the development data starts to increase.