Pre-training via Paraphrasing
Authors: Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, Luke Zettlemoyer
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization. The objective noisily captures aspects of paraphrase, translation, multi-document summarization, and information retrieval, allowing for strong zero-shot performance on several tasks. For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation. We further show that fine-tuning gives strong performance on a range of discriminative and generative tasks in many languages, making MARGE the most generally applicable pre-training method to date.We compare performance to published numbers for these models. |
| Researcher Affiliation | Industry | Facebook AI mikelewis@fb.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Using a multilingual version of the CC-NEWS corpus [Liu et al., 2019], we train initially using the with 64 workers for 450k steps... and To explore domain effects, we further pre-train for 100k steps on Wikipedia data... |
| Dataset Splits | No | The paper mentions 'development data' for tuning hyperparameters but does not provide specific details on overall train/validation/test dataset splits (percentages, sample counts, or defined methodologies) for its main pre-training data (CC-NEWS, Wikipedia). |
| Hardware Specification | No | The paper mentions 'GPU memory' and 'GPU Days (estimated)' in Table 1, and the use of 'workers' (64, then 2048) during pre-training, but does not provide specific hardware details such as exact GPU/CPU models or processor types used for running its experiments. |
| Software Dependencies | No | The paper mentions using a 'Transformer model' and 'Fast Text' but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | we train initially using the with 64 workers for 450k steps (linearly annealing the learning rate from 1e-04 to 0 with 10k warmup steps), and then continue training with 2048 workers with 550k steps (annealing the learning rate from 2e-04 to 0). |