Preserving Pre-trained Features Helps Calibrate Fine-tuned Language Models
Authors: Guande He, Jianfei Chen, Jun Zhu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on three natural language understanding tasks |
| Researcher Affiliation | Collaboration | 1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua-Bosch Joint Center for ML, BNRist Center, State Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China |
| Pseudocode | Yes | Algorithm 1: Mask-Predict with Rejection Sampling |
| Open Source Code | Yes | we submit our codes as the supplementary material. |
| Open Datasets | Yes | SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) are ID and OD datasets for NLI; QQP (Shankar et al., 2017) and Twitter PPDB (Lan et al., 2017) are ID and OD datasets for PD; SWAG (Zellers et al., 2018) and Hella SWAG (Zellers et al., 2019) are ID and OD datasets for CR. We use the same train/validation/test split for those six datasets published by Desai & Durrett (2020). For each task, we fine-tune the model using the ID training set and evaluate the model s performance with both ID and OD test sets. We use Wiki Text-103 (Merity et al., 2016) as the corpus of the pre-training phase for JL-P. The detailed statistics for each dataset can be found in Appendix A.1. |
| Dataset Splits | Yes | We use the same train/validation/test split for those six datasets published by Desai & Durrett (2020). ... The statistic of the datasets for both MLM and NLU tasks are shown in Table 3. For the datasets of NLU tasks (SNLI/MNLI, QQP/Twitter PPDB, SWAG/Hella SWAG), we use the published version by Desai & Durrett (2020). ... Table 3: The size of the training, validation, test splits and number of labels for all datasets. |
| Hardware Specification | Yes | All experiments are run for 3 training epochs and are deployed on a single NVIDIA A40 48G GPU within 3 hours to fine-tune a single model. |
| Software Dependencies | No | The paper mentions software like 'Huggingface Transformers library' (Wolf et al., 2020), 'Open Delta library' (Ding et al., 2022), 'Adam W optimizer' (Loshchilov & Hutter, 2019), and 'Huggingface Datasets (Lhoest et al., 2021) library', but it does not specify explicit version numbers for these software components. |
| Experiment Setup | Yes | We conduct hyperparameter search for the mask probability pmask, the scaling factor αmlm of the MLM loss, and the regularization coefficient βL2 on the contextualized representation. We also tune the hyperparameter σls for label smoothing (LS). We search all the hyperparameters on the validation set of each task independently. Completed setup details for each method on each task can be found in Appendix A.2. ... we set a batch size of 32, a maximum sequence length of 256, and a weight decay of 0.1. |