Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
EMO: EARTH MOVER DISTANCE OPTIMIZATION FOR AUTO-REGRESSIVE LANGUAGE MODELING
Authors: Siyu Ren, Zhiyong Wu, Kenny Q. Zhu
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Upon extensive evaluation, EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO shows noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences, highlighting its potential as a lightweight calibration method for enhancing large-scale pre-trained language models. |
| Researcher Affiliation | Collaboration | Siyu Ren12 , Zhiyong Wu2 , Kenny Q. Zhu3 1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3University of Texas at Arlington EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured steps formatted like code. |
| Open Source Code | Yes | Code available at https://github.com/DRSY/EMO. |
| Open Datasets | Yes | We use 6 English textual corpora from 5 different domains for comprehensive evaluation:(1) Wiki Text-2 and Wiki Text-103 (Merity et al., 2016) are two commonly used language modeling benchmarks consisting of high-quality Wikipedia articles. (2) Web Texttest (Radford et al., 2018) is the test set of the official Web Text dataset from Open AI, that was used to train GPT-2. (3) Penn Tree Bank (PTB) (Marcus et al., 1993) contains Wall Street Journal material in financial domain. (4) Writing Prompts (Fan et al., 2018) features text from the writing prompts forum of Reddit. (5) AG News (Zhang et al., 2015) is a collection of news articles from diverse domains, e.g., business, sports, and science. |
| Dataset Splits | Yes | We fine-tune GPT-2 and OPT-125M for 3 epochs on the training set of each dataset and save the model checkpoint with the lowest validation loss. Table 4: # of dev samples |
| Hardware Specification | No | The paper mentions using specific pre-trained LLMs like LLa Ma-7B/13B and GPT-2, but it does not provide details about the specific hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries, frameworks, or operating systems used in the experiments. |
| Experiment Setup | Yes | We fine-tune GPT-2 and OPT-125M for 3 epochs on the training set of each dataset and save the model checkpoint with the lowest validation loss. We use the Adam W (Loshchilov & Hutter, 2018) optimizer with a learning rate of 5e-5. The batch size is fixed as 32 for all experiments. The maximum input length during training is set to 256. For Tai Lr and Mix CE that involve weighting coefficient, we conduct a hyperparameter sweep within {0.9, 0.8, 0.7}. EMO does not necessitate any hyperparameter tuning. |