Better Fine-Tuning by Reducing Representational Collapse
Authors: Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, Sonal Gupta
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including Daily Mail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We will first measure performance by fine-tuning on a range of tasks and languages. The next sections report why methods rooted in trust region, including ours, outperform standard fine-tuning. |
| Researcher Affiliation | Industry | Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta & Naman Goyal Facebook {armenag,akshats,anchit,naman}@fb.com Luke Zettlemoyer & Sonal Gupta Facebook {lsz, sonalgupta}@fb.com |
| Pseudocode | No | The paper describes its methods using mathematical equations and textual descriptions but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We will first test R3F and R4F on sentence classification tasks from the GLUE benchmark (Wang et al., 2018). We select the same subset of GLUE tasks that have been reported by prior work in this space (Jiang et al., 2019): MNLI (Williams et al., 2018), QQP (Iyer et al., 2017), RTE (Bentivogli et al., 2009), QNLI (Rajpurkar et al., 2016), MRPC (Dolan & Brockett, 2005), Co LA (Warstadt et al., 2018), SST-2 (Socher et al., 2013). We take a look at the popular XNLI benchmark, containing 15 languages (Conneau et al., 2018). abstractive summarization, due to its additional complexity and computational cost, specifically we look at three datasets: CNN/Dailymail (Hermann et al., 2015), Gigaword (Napoles et al., 2012) and Reddit TIFU (Kim et al., 2018). |
| Dataset Splits | Yes | We report the performance of all models on the GLUE development set. We present our best results on the GLUE development set for various fine-tuning methods applied to the Ro BERTa Large model. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU or CPU models, or details about the computing environment used for experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For our GLUE related experiments, both full fine-tuning and probing, the following parameters are used. Table 5: Task specific hyper parameters for GLUE experiments (Learning Rate, Max Updates, Max Sentences). Table 6: Hyper parameters for R3F and R4F experiments on GLUE (Optimizer, LR Scheduler, Dropout, Weight Decay, Warmup Updates, λ, Noise Types, σ). |