Minimum Divergence vs. Maximum Margin: an Empirical Comparison on Seq2Seq Models
Authors: Huan Zhang, Hai Zhao
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that our new training criterion can usually work better than existing methods, on both the tasks of machine translation and sentence summarization. |
| Researcher Affiliation | Academia | Huan Zhang, Hai Zhao Department of Computer Science and Engineering Shanghai Jiao Tong University zhanghuan0468@gmail.com, zhaohai@cs.sjtu.edu.cn |
| Pseudocode | Yes | Algorithm 1 The sampling approach to constructing the approximated n-best list |
| Open Source Code | No | The paper does not provide an explicit statement or a link to open-source code for the methodology it describes. |
| Open Datasets | Yes | We use the IWSLT 2014 German-English translation dataset, with the same splits as Ranzato et al. (2016) and Wiseman & Rush (2016)... We use the Gigaword corpus with the same preprocessing steps as in Rush et al. (2015). |
| Dataset Splits | Yes | We use the IWSLT 2014 German-English translation dataset, with the same splits as Ranzato et al. (2016) and Wiseman & Rush (2016), which contains about 153K training sentence pairs, 7K validation sentence pairs and 7K test sentence pairs... During training, we use the first 2K sequences of the dev corpus as validation set |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments (e.g., GPU models, CPU types, or cloud instance specifications). |
| Software Dependencies | No | The paper mentions software components like 'LSTM', 'RNN', 'GRU', and 'Adam optimizer' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The encoder is a single-layer bidirectional LSTM with 256 hidden units for either direction, and the decoder LSTM also has 256 hidden units. The size of word embedding for both encoder and decoder is 256. We use a dropout rate... of 0.2... The batch size is set to 32 and the training set is shuffled at each new epoch. All models are trained with the Adam optimizer... The MLE baseline is trained with a learning rate of 3.0 10 4. The model is trained for 20 epochs... α for MRT... is set to 5.0 10 3 and the sample size is 100. α and τ for Hellinger loss... α = 5.0 10 4, τ = 0.5. |