EMO: EARTH MOVER DISTANCE OPTIMIZATION FOR AUTO-REGRESSIVE LANGUAGE MODELING
Authors: Siyu Ren, Zhiyong Wu, Kenny Q. Zhu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Upon extensive evaluation, EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO shows noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences, highlighting its potential as a lightweight calibration method for enhancing large-scale pre-trained language models. |
| Researcher Affiliation | Collaboration | Siyu Ren12 , Zhiyong Wu2 , Kenny Q. Zhu3 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3University of Texas at Arlington roy0702@sjtu.edu.cn, wuzhiyong@pjlab.org.cn, kenny.zhu@uta.edu |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured steps formatted like code. |
| Open Source Code | Yes | Code available at https://github.com/DRSY/EMO. |
| Open Datasets | Yes | We use 6 English textual corpora from 5 different domains for comprehensive evaluation:(1) Wiki Text-2 and Wiki Text-103 (Merity et al., 2016) are two commonly used language modeling benchmarks consisting of high-quality Wikipedia articles. (2) Web Texttest (Radford et al., 2018) is the test set of the official Web Text dataset from Open AI, that was used to train GPT-2. (3) Penn Tree Bank (PTB) (Marcus et al., 1993) contains Wall Street Journal material in financial domain. (4) Writing Prompts (Fan et al., 2018) features text from the writing prompts forum of Reddit. (5) AG News (Zhang et al., 2015) is a collection of news articles from diverse domains, e.g., business, sports, and science. |
| Dataset Splits | Yes | We fine-tune GPT-2 and OPT-125M for 3 epochs on the training set of each dataset and save the model checkpoint with the lowest validation loss. Table 4: # of dev samples |
| Hardware Specification | No | The paper mentions using specific pre-trained LLMs like LLa Ma-7B/13B and GPT-2, but it does not provide details about the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or operating systems used in the experiments. |
| Experiment Setup | Yes | We fine-tune GPT-2 and OPT-125M for 3 epochs on the training set of each dataset and save the model checkpoint with the lowest validation loss. We use the Adam W (Loshchilov & Hutter, 2018) optimizer with a learning rate of 5e-5. The batch size is fixed as 32 for all experiments. The maximum input length during training is set to 256. For Tai Lr and Mix CE that involve weighting coefficient, we conduct a hyperparameter sweep within {0.9, 0.8, 0.7}. EMO does not necessitate any hyperparameter tuning. |