Language Generation with Strictly Proper Scoring Rules
Authors: Chenze Shao, Fandong Meng, Yijin Liu, Jie Zhou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results indicate that simply substituting the loss function, without adjusting other hyperparameters, can yield substantial improvements in model s generation capabilities. |
| Researcher Affiliation | Industry | Chenze Shao 1 Fandong Meng 1 Yijin Liu 1 Jie Zhou 1 1Pattern Recognition Center, We Chat AI, Tencent Inc. Correspondence to: Chenze Shao <chenzeshao@tencent.com>, Fandong Meng <fandongmeng@tencent.com>, Yijin Liu <yijinliu@tencent.com>, Jie Zhou <withtomzhou@tencent.com>. |
| Pseudocode | No | The paper describes the proposed methods using prose and mathematical equations, but it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor any structured step-by-step procedures formatted like code. |
| Open Source Code | Yes | Source code: https://github.com/shaochenz e/Scoring Rules LM. |
| Open Datasets | Yes | For machine translation, we conduct experiments on widely used translation benchmarks under difference scales: WMT14 English French (En-Fr, 35.8M pairs), WMT14 English-German (En-De, 4.5M pairs), TED bilingual dataset (10 directions, each with 200K pairs)...For abstractive summarization, We conduct experiments on the summarization benchmark CNN/Daily Mail (311K pairs, Hermann et al., 2015)...We conduct instruction tuning using the Alpaca dataset by GPT4 (Wang et al., 2022; Taori et al., 2023). |
| Dataset Splits | Yes | For WMT datasets, we use newstest2013 for validation and newstest2014 for test |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for its experiments, such as GPU models (e.g., NVIDIA A100), CPU types, or cloud instance specifications. It only mentions the use of 'large language models (LLMs) such as LLa MA-7B and LLa MA-13B'. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other frameworks used for implementation or experimentation. |
| Experiment Setup | Yes | Table 2. Implementation details on different datasets. Dataset En-De En-Fr TED CNN batch size 32k 32k 32k 64k learning rate 7e-4 5e-4 7e-4 2e-4 dropout 0.1 0.1 0.3 0.1 attention dropout 0 0 0 0.1 warmup steps 4k 4k 4k 2k training steps 200k 300k 18k 100k fine-tuning steps 50k 50k 4k 20k weight decay 0 0 0.0 0.01 beam size 5 5 5 4 length penalty 0 0.6 1 2 |