reproducibilityindex.ai

Preserving Pre-trained Features Helps Calibrate Fine-tuned Language Models

Authors: Guande He, Jianfei Chen, Jun Zhu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on three natural language understanding tasks
Researcher Affiliation	Collaboration	1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua-Bosch Joint Center for ML, BNRist Center, State Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China
Pseudocode	Yes	Algorithm 1: Mask-Predict with Rejection Sampling
Open Source Code	Yes	we submit our codes as the supplementary material.
Open Datasets	Yes	SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) are ID and OD datasets for NLI; QQP (Shankar et al., 2017) and Twitter PPDB (Lan et al., 2017) are ID and OD datasets for PD; SWAG (Zellers et al., 2018) and Hella SWAG (Zellers et al., 2019) are ID and OD datasets for CR. We use the same train/validation/test split for those six datasets published by Desai & Durrett (2020). For each task, we ﬁne-tune the model using the ID training set and evaluate the model s performance with both ID and OD test sets. We use Wiki Text-103 (Merity et al., 2016) as the corpus of the pre-training phase for JL-P. The detailed statistics for each dataset can be found in Appendix A.1.
Dataset Splits	Yes	We use the same train/validation/test split for those six datasets published by Desai & Durrett (2020). ... The statistic of the datasets for both MLM and NLU tasks are shown in Table 3. For the datasets of NLU tasks (SNLI/MNLI, QQP/Twitter PPDB, SWAG/Hella SWAG), we use the published version by Desai & Durrett (2020). ... Table 3: The size of the training, validation, test splits and number of labels for all datasets.
Hardware Specification	Yes	All experiments are run for 3 training epochs and are deployed on a single NVIDIA A40 48G GPU within 3 hours to ﬁne-tune a single model.
Software Dependencies	No	The paper mentions software like 'Huggingface Transformers library' (Wolf et al., 2020), 'Open Delta library' (Ding et al., 2022), 'Adam W optimizer' (Loshchilov & Hutter, 2019), and 'Huggingface Datasets (Lhoest et al., 2021) library', but it does not specify explicit version numbers for these software components.
Experiment Setup	Yes	We conduct hyperparameter search for the mask probability pmask, the scaling factor αmlm of the MLM loss, and the regularization coefﬁcient βL2 on the contextualized representation. We also tune the hyperparameter σls for label smoothing (LS). We search all the hyperparameters on the validation set of each task independently. Completed setup details for each method on each task can be found in Appendix A.2. ... we set a batch size of 32, a maximum sequence length of 256, and a weight decay of 0.1.