Preserving Pre-trained Features Helps Calibrate Fine-tuned Language Models

Authors: Guande He, Jianfei Chen, Jun Zhu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on three natural language understanding tasks
Researcher Affiliation Collaboration 1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua-Bosch Joint Center for ML, BNRist Center, State Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China
Pseudocode Yes Algorithm 1: Mask-Predict with Rejection Sampling
Open Source Code Yes we submit our codes as the supplementary material.
Open Datasets Yes SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) are ID and OD datasets for NLI; QQP (Shankar et al., 2017) and Twitter PPDB (Lan et al., 2017) are ID and OD datasets for PD; SWAG (Zellers et al., 2018) and Hella SWAG (Zellers et al., 2019) are ID and OD datasets for CR. We use the same train/validation/test split for those six datasets published by Desai & Durrett (2020). For each task, we fine-tune the model using the ID training set and evaluate the model s performance with both ID and OD test sets. We use Wiki Text-103 (Merity et al., 2016) as the corpus of the pre-training phase for JL-P. The detailed statistics for each dataset can be found in Appendix A.1.
Dataset Splits Yes We use the same train/validation/test split for those six datasets published by Desai & Durrett (2020). ... The statistic of the datasets for both MLM and NLU tasks are shown in Table 3. For the datasets of NLU tasks (SNLI/MNLI, QQP/Twitter PPDB, SWAG/Hella SWAG), we use the published version by Desai & Durrett (2020). ... Table 3: The size of the training, validation, test splits and number of labels for all datasets.
Hardware Specification Yes All experiments are run for 3 training epochs and are deployed on a single NVIDIA A40 48G GPU within 3 hours to fine-tune a single model.
Software Dependencies No The paper mentions software like 'Huggingface Transformers library' (Wolf et al., 2020), 'Open Delta library' (Ding et al., 2022), 'Adam W optimizer' (Loshchilov & Hutter, 2019), and 'Huggingface Datasets (Lhoest et al., 2021) library', but it does not specify explicit version numbers for these software components.
Experiment Setup Yes We conduct hyperparameter search for the mask probability pmask, the scaling factor αmlm of the MLM loss, and the regularization coefficient βL2 on the contextualized representation. We also tune the hyperparameter σls for label smoothing (LS). We search all the hyperparameters on the validation set of each task independently. Completed setup details for each method on each task can be found in Appendix A.2. ... we set a batch size of 32, a maximum sequence length of 256, and a weight decay of 0.1.