Language Generation with Strictly Proper Scoring Rules

Authors: Chenze Shao, Fandong Meng, Yijin Liu, Jie Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results indicate that simply substituting the loss function, without adjusting other hyperparameters, can yield substantial improvements in model s generation capabilities.
Researcher Affiliation Industry Chenze Shao 1 Fandong Meng 1 Yijin Liu 1 Jie Zhou 1 1Pattern Recognition Center, We Chat AI, Tencent Inc. Correspondence to: Chenze Shao <chenzeshao@tencent.com>, Fandong Meng <fandongmeng@tencent.com>, Yijin Liu <yijinliu@tencent.com>, Jie Zhou <withtomzhou@tencent.com>.
Pseudocode No The paper describes the proposed methods using prose and mathematical equations, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor any structured step-by-step procedures formatted like code.
Open Source Code Yes Source code: https://github.com/shaochenz e/Scoring Rules LM.
Open Datasets Yes For machine translation, we conduct experiments on widely used translation benchmarks under difference scales: WMT14 English French (En-Fr, 35.8M pairs), WMT14 English-German (En-De, 4.5M pairs), TED bilingual dataset (10 directions, each with 200K pairs)...For abstractive summarization, We conduct experiments on the summarization benchmark CNN/Daily Mail (311K pairs, Hermann et al., 2015)...We conduct instruction tuning using the Alpaca dataset by GPT4 (Wang et al., 2022; Taori et al., 2023).
Dataset Splits Yes For WMT datasets, we use newstest2013 for validation and newstest2014 for test
Hardware Specification No The paper does not explicitly describe the specific hardware used for its experiments, such as GPU models (e.g., NVIDIA A100), CPU types, or cloud instance specifications. It only mentions the use of 'large language models (LLMs) such as LLa MA-7B and LLa MA-13B'.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other frameworks used for implementation or experimentation.
Experiment Setup Yes Table 2. Implementation details on different datasets. Dataset En-De En-Fr TED CNN batch size 32k 32k 32k 64k learning rate 7e-4 5e-4 7e-4 2e-4 dropout 0.1 0.1 0.3 0.1 attention dropout 0 0 0 0.1 warmup steps 4k 4k 4k 2k training steps 200k 300k 18k 100k fine-tuning steps 50k 50k 4k 20k weight decay 0 0 0.0 0.01 beam size 5 5 5 4 length penalty 0 0.6 1 2