reproducibilityindex.ai

EMO: EARTH MOVER DISTANCE OPTIMIZATION FOR AUTO-REGRESSIVE LANGUAGE MODELING

Authors: Siyu Ren, Zhiyong Wu, Kenny Q. Zhu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Upon extensive evaluation, EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO shows noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences, highlighting its potential as a lightweight calibration method for enhancing large-scale pre-trained language models.
Researcher Affiliation	Collaboration	Siyu Ren12 , Zhiyong Wu2 , Kenny Q. Zhu3 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3University of Texas at Arlington roy0702@sjtu.edu.cn, wuzhiyong@pjlab.org.cn, kenny.zhu@uta.edu
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured steps formatted like code.
Open Source Code	Yes	Code available at https://github.com/DRSY/EMO.
Open Datasets	Yes	We use 6 English textual corpora from 5 different domains for comprehensive evaluation:(1) Wiki Text-2 and Wiki Text-103 (Merity et al., 2016) are two commonly used language modeling benchmarks consisting of high-quality Wikipedia articles. (2) Web Texttest (Radford et al., 2018) is the test set of the official Web Text dataset from Open AI, that was used to train GPT-2. (3) Penn Tree Bank (PTB) (Marcus et al., 1993) contains Wall Street Journal material in financial domain. (4) Writing Prompts (Fan et al., 2018) features text from the writing prompts forum of Reddit. (5) AG News (Zhang et al., 2015) is a collection of news articles from diverse domains, e.g., business, sports, and science.
Dataset Splits	Yes	We fine-tune GPT-2 and OPT-125M for 3 epochs on the training set of each dataset and save the model checkpoint with the lowest validation loss. Table 4: # of dev samples
Hardware Specification	No	The paper mentions using specific pre-trained LLMs like LLa Ma-7B/13B and GPT-2, but it does not provide details about the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or operating systems used in the experiments.
Experiment Setup	Yes	We fine-tune GPT-2 and OPT-125M for 3 epochs on the training set of each dataset and save the model checkpoint with the lowest validation loss. We use the Adam W (Loshchilov & Hutter, 2018) optimizer with a learning rate of 5e-5. The batch size is fixed as 32 for all experiments. The maximum input length during training is set to 256. For Tai Lr and Mix CE that involve weighting coefficient, we conduct a hyperparameter sweep within {0.9, 0.8, 0.7}. EMO does not necessitate any hyperparameter tuning.