Beyond MLE: Convex Learning for Text Generation

Authors: Chenze Shao, Zhengrui Ma, Min Zhang, Yang Feng

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on various text generation tasks and models show the effectiveness of our approach.
Researcher Affiliation Academia 1 Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences 2 University of Chinese Academy of Sciences 3 School of Future Science and Engineering, Soochow University
Pseudocode No The paper includes mathematical formulations and proofs but no structured pseudocode or algorithm blocks.
Open Source Code Yes Source code is available at https://github.com/ictnlp/Convex-Learning.
Open Datasets Yes Datasets We conduct experiments on widely used translation benchmark: WMT14 English-German (EN-DE, 4.5M)... We conduct experiments on two widely used summarization benchmarks: CNN/Daily Mail [18] and Xsum [34].
Dataset Splits Yes Datasets We conduct experiments on widely used translation benchmark: WMT14 English-German (EN-DE, 4.5M), where the validation and test sets are newstest2013 and newstest2014 respectively.
Hardware Specification Yes The decoding speedup is measured with a batch size of 1 on Ge Force RTX 3090 GPUs.
Software Dependencies No The paper mentions software like Adam optimizer, BPE, GPT-2 tokenizer, and Berttokenizer, but does not provide specific version numbers for these or other key software components.
Experiment Setup Yes Detailed information regarding other training hyperparameters can be found in Table 7. Table 7: Settings of training hyperparameters on WMT14 EN DE dataset. Transformer Vanilla-NAT CMLM CTC MLE Convex MLE Convex MLE Convex MLE Convex batch size 32k 32k 64k 256k 64k 256k 64k 256k learning rate 7e-4 2e-4 5e-4 3e-4 5e-4 3e-4 5e-4 3e-4 warmup steps 4k 1k 10k 500 10k 500 10k 500 training steps 200k 50k 300k 10k 300k 10k 300k 10k dropout 0.1 0.1 0.3 0.3 0.3 0.3 0.3 0.1 weight decay 0 0 0.01 0.01 0.01 0.01 0.01 0.01 label smoothing 0.1 0.1 0.1 0 0.1 0 0.01 0 length loss factor 0.1 0.01 0.1 0.01 -