The Neural Noisy Channel
Authors: Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, Tomas Kocisky
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on abstractive sentence summarisation, morphological inflection, and machine translation show that noisy channel models outperform direct models, and that they significantly benefit from increased amounts of unpaired output data that direct models cannot easily use. |
| Researcher Affiliation | Collaboration | Lei Yu1 , Phil Blunsom1,2, Chris Dyer2, Edward Grefenstette2, and Tom aˇs Koˇcisk y1,2 1University of Oxford and 2Deep Mind |
| Pseudocode | Yes | Algorithm 1 Noisy Channel Decoding |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | The dataset (Rush et al., 2015) that we use is constructed by pairing the first sentence and the headline of each article from the annotated Gigaword corpus (Graff et al., 2003; Napoles et al., 2012). [...] We used parallel data with 184k sentence pairs (from the FBIS corpus, LDC2003E14) and monolingual data with 4.3 million of English sentences (selected from the English Gigaword). [...] The dataset (Durrett & De Nero, 2013) that we use in the experiments is created from Wiktionary [...] Our language models were trained on word types extracted by running a morphological analysis tool on the WMT 2016 monolingual data |
| Dataset Splits | Yes | There are 3.8m, 190k and 381k sentence pairs in the training, validation and test sets, respectively. [...] The train/dev/test split for German nouns is 2364/200/200, and for German verbs is 1617/200/200. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions optimizers (Adam) and network architectures (LSTMs) but does not provide specific version numbers for software libraries or frameworks used (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | The loss (Equation 2) is optimized by Adam (Kingma & Ba, 2015), with initial learning rate of 0.001. We use LSTMs with 1 layer for both the encoder and decoders, with hidden units of 256. The mini-batch size is 32, and dropout of 0.2 is applied to the input and output of LSTMs. For the language model, we use a 2-layer LSTM with 1024 hidden units and 0.5 dropout. The learning rate is 0.0001. All the hyperparameters are optimised via grid search on the perplexity of the validation set. During decoding, beam search is employed with the number of proposals generated by the direct model K1 = 20, and the number of best candidates selected by the noisy channel model K2 = 10. |