Beyond MLE: Convex Learning for Text Generation
Authors: Chenze Shao, Zhengrui Ma, Min Zhang, Yang Feng
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various text generation tasks and models show the effectiveness of our approach. |
| Researcher Affiliation | Academia | 1 Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences 3 School of Future Science and Engineering, Soochow University |
| Pseudocode | No | The paper includes mathematical formulations and proofs but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code is available at https://github.com/ictnlp/Convex-Learning. |
| Open Datasets | Yes | Datasets We conduct experiments on widely used translation benchmark: WMT14 English-German (EN-DE, 4.5M)... We conduct experiments on two widely used summarization benchmarks: CNN/Daily Mail [18] and Xsum [34]. |
| Dataset Splits | Yes | Datasets We conduct experiments on widely used translation benchmark: WMT14 English-German (EN-DE, 4.5M), where the validation and test sets are newstest2013 and newstest2014 respectively. |
| Hardware Specification | Yes | The decoding speedup is measured with a batch size of 1 on Ge Force RTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions software like Adam optimizer, BPE, GPT-2 tokenizer, and Berttokenizer, but does not provide specific version numbers for these or other key software components. |
| Experiment Setup | Yes | Detailed information regarding other training hyperparameters can be found in Table 7. Table 7: Settings of training hyperparameters on WMT14 EN DE dataset. Transformer Vanilla-NAT CMLM CTC MLE Convex MLE Convex MLE Convex MLE Convex batch size 32k 32k 64k 256k 64k 256k 64k 256k learning rate 7e-4 2e-4 5e-4 3e-4 5e-4 3e-4 5e-4 3e-4 warmup steps 4k 1k 10k 500 10k 500 10k 500 training steps 200k 50k 300k 10k 300k 10k 300k 10k dropout 0.1 0.1 0.3 0.3 0.3 0.3 0.3 0.1 weight decay 0 0 0.01 0.01 0.01 0.01 0.01 0.01 label smoothing 0.1 0.1 0.1 0 0.1 0 0.01 0 length loss factor 0.1 0.01 0.1 0.01 - |