EMO: EARTH MOVER DISTANCE OPTIMIZATION FOR AUTO-REGRESSIVE LANGUAGE MODELING

Authors: Siyu Ren, Zhiyong Wu, Kenny Q. Zhu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Upon extensive evaluation, EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO shows noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences, highlighting its potential as a lightweight calibration method for enhancing large-scale pre-trained language models.
Researcher Affiliation Collaboration Siyu Ren12 , Zhiyong Wu2 , Kenny Q. Zhu3 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3University of Texas at Arlington roy0702@sjtu.edu.cn, wuzhiyong@pjlab.org.cn, kenny.zhu@uta.edu
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured steps formatted like code.
Open Source Code Yes Code available at https://github.com/DRSY/EMO.
Open Datasets Yes We use 6 English textual corpora from 5 different domains for comprehensive evaluation:(1) Wiki Text-2 and Wiki Text-103 (Merity et al., 2016) are two commonly used language modeling benchmarks consisting of high-quality Wikipedia articles. (2) Web Texttest (Radford et al., 2018) is the test set of the official Web Text dataset from Open AI, that was used to train GPT-2. (3) Penn Tree Bank (PTB) (Marcus et al., 1993) contains Wall Street Journal material in financial domain. (4) Writing Prompts (Fan et al., 2018) features text from the writing prompts forum of Reddit. (5) AG News (Zhang et al., 2015) is a collection of news articles from diverse domains, e.g., business, sports, and science.
Dataset Splits Yes We fine-tune GPT-2 and OPT-125M for 3 epochs on the training set of each dataset and save the model checkpoint with the lowest validation loss. Table 4: # of dev samples
Hardware Specification No The paper mentions using specific pre-trained LLMs like LLa Ma-7B/13B and GPT-2, but it does not provide details about the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or operating systems used in the experiments.
Experiment Setup Yes We fine-tune GPT-2 and OPT-125M for 3 epochs on the training set of each dataset and save the model checkpoint with the lowest validation loss. We use the Adam W (Loshchilov & Hutter, 2018) optimizer with a learning rate of 5e-5. The batch size is fixed as 32 for all experiments. The maximum input length during training is set to 256. For Tai Lr and Mix CE that involve weighting coefficient, we conduct a hyperparameter sweep within {0.9, 0.8, 0.7}. EMO does not necessitate any hyperparameter tuning.