Kernelized Bayesian Softmax for Text Generation
Authors: Ning Miao, Hao Zhou, Chengqi Zhao, Wenxian Shi, Lei Li
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on a variety of text generation tasks including machine translation, language modeling, and dialog generation. The empirical results verify the effectiveness of Ker BS. Ablation study indicates that each part of Ker BS, including the Bayesian composition and the kernel function, is necessary for the performance improvement. |
| Researcher Affiliation | Industry | Ning Miao Hao Zhou Chengqi Zhao Wenxian Shi Lei Li Byte Dance AI lab {miaoning,zhouhao.nlp,zhaochengqi.d,shiwenxian,lileilab}@bytedance.com |
| Pseudocode | Yes | Algorithm 1: Training scheme for Ker BS |
| Open Source Code | No | The paper does not include any statement or link providing access to the source code for the described methodology. |
| Open Datasets | Yes | We employ the Daily Dialog dataset from Li et al. [2017] for experiment, by deleting the overlapping of train and test sets in advance. |
| Dataset Splits | Yes | Following previous work, we use a 300k, 10k and 30k subset of One-Billion-Word Corpus for training, validating and testing, respectively. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions tools and algorithms like |
| Experiment Setup | Yes | For Seq2Seq, (hidden size, embedding dimension) are set to (512, 256) and (1024, 512), respectively. And For Transformer, (hidden size, embedding dim, dropout, layer num, head num) is set to (288, 507, 0.1, 5, 2) for both MT and Dialog, following Lee et al. [2018]. All models are trained on sentences with up to 80 words. We set the batch size to 128 and the beam size to 5 for decoding. (...) For LM, we set the initial learning rate to 1.0, and the decay rate to 0.8. For MT and Dialog, the initial learning rate is 5e-4 and the decay rate is 0.5. |