reproducibilityindex.ai

Tailoring Language Generation Models under Total Variation Distance

Authors: Haozhe Ji, Pei Ke, Zhipeng Hu, Rongsheng Zhang, Minlie Huang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our method alleviates the overestimation of degenerated sequences without sacrificing diversity and improves generation quality on a wide range of text generation tasks.
Researcher Affiliation	Collaboration	Haozhe Ji1, Pei Ke1, Zhipeng Hu2, Rongsheng Zhang2, Minlie Huang1 1Dept. of Comp. Sci. & Tech., State Key Lab of Intelligent Tech. & Sys. 1BNRist Center, Tsinghua University, Beijing 100084, China 2Fuxi AI Lab, Net Ease Inc., China
Pseudocode	No	The paper includes a computational graph (Figure 2) but no explicitly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	Code is available at https://github.com/thu-coai/Tai Lr.
Open Datasets	Yes	Specifically, we train a 1-layer LSTM on the texts of the COCO image caption dataset (Lin et al., 2014) without any conditional inputs. Statistics and sources of all datasets used in experiments are provided in Appendix D. To download and preprocess the IWSLT14 De-En dataset, we follow the instructions in https://github.com/facebookresearch/fairseq/tree/main/examples/ translation. To download and preprocess the Gigaword corpus, we follow the instructions in https://huggingface.co/datasets/gigaword. To download the Writing Prompts dataset, we follow the instructions in https://github.com/facebookresearch/ fairseq/tree/main/examples/stories.
Dataset Splits	Yes	We sample 10K synthetic data for training and 5K for validation. The best checkpoint is selected based on the highest BLEU (Papineni et al., 2002) score on the development set. We select the best checkpoint based on the highest ROUGE-L (Lin, 2004) score on the development set. Statistics of the datasets used in 4.2 in Table 5 (which includes a 'dev' column).
Hardware Specification	No	The paper mentions running experiments on "GPUs" but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using the "fairseq Toolkit" and "Adam optimizer" but does not specify version numbers for these or any other software dependencies, making replication difficult.
Experiment Setup	Yes	We train two LSTMs...for 100 epochs...using Adam optimizer (β1 = 0.9, β2 = 0.999) with a fixed learning rate of 1e-3 and no weight decay. We use a maximum number of 4096 tokens at each batch. The dropout rate is set to 0.1. We tune the hyperparameter in the proxy distribution γ {10 8, 10 7, . . . , 0.1, 1.0}. Both models are trained for 5 epochs with a initial learning rate of 10 3 using linear scheduler. The batch size is set to 64, with gradient accumulation steps of 4. The models are trained with Adam optimizer (β1 = 0.9, β2 = 0.98) using inverse square root schedule with a initial learning rate of 3e-4 and a weight decay of 1e-4. We train the models for a total number of 80 epochs with a maximum of 4096 tokens per batch and use 4000 steps of warmup update. We set the dropout rate to 0.3, and use label smoothing of 0.1 as standard practice.