reproducibilityindex.ai

Transferable Post-hoc Calibration on Pretrained Transformers in Noisy Text Classification

Authors: Jun Zhang, Wen Yao, Xiaoqian Chen, Ling Feng

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate the performance of the proposed methods under noisy settings, we construct a benchmark consisting of four noise types and five shift intensities based on the QNLI, AG-News, and Emotion tasks. Experimental results on the noisy benchmark show that (1) the metrics are effective in measuring distribution shift and (2) transferable TS can significantly decrease the expected calibration error (ECE) compared with the competitive baseline ensemble TS by approximately 46.09%.
Researcher Affiliation	Academia	Jun Zhang1,2, Wen Yao2, Xiaoqian Chen2, Ling Feng1 , 1 Tsinghua University 2 National Innovation Institute of Defense Technology, Chinese Academy of Military Science jun-zhan19@mails.tsinghua.edu.cn, wendy0782@126.com, chenxiaoqian@nudt.edu.cn, fengling@tsinghua.edu.cn
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks (e.g., a figure or section explicitly labeled 'Pseudocode' or 'Algorithm').
Open Source Code	No	The paper provides a link for the baseline ETS method's code ('https://github.com/zhang64-llnl/Mix-n-Match-Calibration') but does not provide concrete access to the source code for the authors' own proposed 'transferable TS' methodology.
Open Datasets	Yes	We choose three text classification tasks: QNLI2 (Wang et al. 2018a), AG-News3 (Zhang, Zhao, and Le Cun 2015), and Emotion4 (Saravia et al. 2018). The dataset splits are shown in Table 2. 2https://huggingface.co/datasets/glue 3https://huggingface.co/datasets/ag_news 4https://huggingface.co/datasets/emotion
Dataset Splits	Yes	The dataset splits are shown in Table 2. We use the original validation set with true labels as our test set, the new version of validation set is sampled from the original training set in the benchmark, which has the same size as the test set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using 'BERT' and 'RoBERTa' models and the 'Text Attack framework', but it does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	The dropout rate is set to 0.2, and the sample size of MCD is 5. The number of bins in ECE is set to 10. We generate five levels of noisy datasets based on the clean texts via increasing the noise proportion by Char Swap method in Text Attack framework.