reproducibilityindex.ai

Prototypical Calibration for Few-shot Learning of Language Models

Authors: Zhixiong Han, Yaru Hao, Li Dong, Yutao Sun, Furu Wei

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that prototypical calibration yields a substantial improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance.
Researcher Affiliation	Industry	Zhixiong Han , Yaru Hao, Li Dong, Yutao Sun , Furu Wei Microsoft Research {zhixhan8,sunyutao20001121}@gmail.com, {yaruhao,lidong1,fuwei}@microsoft.com
Pseudocode	No	The paper describes the proposed method, including equations and a step-by-step explanation (e.g., "PROTOTYPICAL CLUSTER ESTIMATION", "CLUSTER-LABEL ASSIGNMENT", "INFERENCE"), but it does not include a formally labeled "Pseudocode" or "Algorithm" block.
Open Source Code	Yes	The code will be released at https://github.com/zhixhan/Pro Ca.
Open Datasets	Yes	We evaluate the proposed method on nine widely-used text-classification datasets including SST2 (Socher et al., 2013), SST-5 (Socher et al., 2013), Subj (Pang & Lee, 2004), MR (Pang & Lee, 2005), AP (Zhang et al., 2015), DBPedia (Zhang et al., 2015), AGNews (Zhang et al., 2015), RTE (Dagan et al., 2005), and TREC (Voorhees & Tice, 2000).
Dataset Splits	Yes	We use the full validation set for evaluation except for AGNews, DBPedia, and AP, for which we randomly sample 2000 test examples. We compute the average accuracy on the validation set over five random seeds for each setting except for Bloom using 2 seeds.
Hardware Specification	Yes	We conduct the evaluation on 8 Tesla A100 GPUs for Bloom and Tesla V100 GPUs for other models.
Software Dependencies	No	The paper mentions using "GPT-family" models (GPT-2-large, GPT-2-XL, GPT-neo, GPT-J, Bloom) but does not specify software dependencies such as Python versions, deep learning frameworks (e.g., PyTorch, TensorFlow), or specific library versions with numerical identifiers.
Experiment Setup	Yes	We use the k-means algorithm to initialize GMM parameters to accelerate the convergence. The maximum iterations and the convergence threshold for each EM process are set to 100 and 1e-3 respectively. Moreover, we repeat the estimation multiple times with different random seeds.