Tailoring Language Generation Models under Total Variation Distance

Authors: Haozhe Ji, Pei Ke, Zhipeng Hu, Rongsheng Zhang, Minlie Huang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our method alleviates the overestimation of degenerated sequences without sacrificing diversity and improves generation quality on a wide range of text generation tasks.
Researcher Affiliation Collaboration Haozhe Ji1, Pei Ke1, Zhipeng Hu2, Rongsheng Zhang2, Minlie Huang1 1Dept. of Comp. Sci. & Tech., State Key Lab of Intelligent Tech. & Sys. 1BNRist Center, Tsinghua University, Beijing 100084, China 2Fuxi AI Lab, Net Ease Inc., China
Pseudocode No The paper includes a computational graph (Figure 2) but no explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Code is available at https://github.com/thu-coai/Tai Lr.
Open Datasets Yes Specifically, we train a 1-layer LSTM on the texts of the COCO image caption dataset (Lin et al., 2014) without any conditional inputs. Statistics and sources of all datasets used in experiments are provided in Appendix D. To download and preprocess the IWSLT14 De-En dataset, we follow the instructions in https://github.com/facebookresearch/fairseq/tree/main/examples/ translation. To download and preprocess the Gigaword corpus, we follow the instructions in https://huggingface.co/datasets/gigaword. To download the Writing Prompts dataset, we follow the instructions in https://github.com/facebookresearch/ fairseq/tree/main/examples/stories.
Dataset Splits Yes We sample 10K synthetic data for training and 5K for validation. The best checkpoint is selected based on the highest BLEU (Papineni et al., 2002) score on the development set. We select the best checkpoint based on the highest ROUGE-L (Lin, 2004) score on the development set. Statistics of the datasets used in 4.2 in Table 5 (which includes a 'dev' column).
Hardware Specification No The paper mentions running experiments on "GPUs" but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using the "fairseq Toolkit" and "Adam optimizer" but does not specify version numbers for these or any other software dependencies, making replication difficult.
Experiment Setup Yes We train two LSTMs...for 100 epochs...using Adam optimizer (β1 = 0.9, β2 = 0.999) with a fixed learning rate of 1e-3 and no weight decay. We use a maximum number of 4096 tokens at each batch. The dropout rate is set to 0.1. We tune the hyperparameter in the proxy distribution γ {10 8, 10 7, . . . , 0.1, 1.0}. Both models are trained for 5 epochs with a initial learning rate of 10 3 using linear scheduler. The batch size is set to 64, with gradient accumulation steps of 4. The models are trained with Adam optimizer (β1 = 0.9, β2 = 0.98) using inverse square root schedule with a initial learning rate of 3e-4 and a weight decay of 1e-4. We train the models for a total number of 80 epochs with a maximum of 4096 tokens per batch and use 4000 steps of warmup update. We set the dropout rate to 0.3, and use label smoothing of 0.1 as standard practice.