Prototypical Calibration for Few-shot Learning of Language Models

Authors: Zhixiong Han, Yaru Hao, Li Dong, Yutao Sun, Furu Wei

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that prototypical calibration yields a substantial improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance.
Researcher Affiliation Industry Zhixiong Han , Yaru Hao, Li Dong, Yutao Sun , Furu Wei Microsoft Research {zhixhan8,sunyutao20001121}@gmail.com, {yaruhao,lidong1,fuwei}@microsoft.com
Pseudocode No The paper describes the proposed method, including equations and a step-by-step explanation (e.g., "PROTOTYPICAL CLUSTER ESTIMATION", "CLUSTER-LABEL ASSIGNMENT", "INFERENCE"), but it does not include a formally labeled "Pseudocode" or "Algorithm" block.
Open Source Code Yes The code will be released at https://github.com/zhixhan/Pro Ca.
Open Datasets Yes We evaluate the proposed method on nine widely-used text-classification datasets including SST2 (Socher et al., 2013), SST-5 (Socher et al., 2013), Subj (Pang & Lee, 2004), MR (Pang & Lee, 2005), AP (Zhang et al., 2015), DBPedia (Zhang et al., 2015), AGNews (Zhang et al., 2015), RTE (Dagan et al., 2005), and TREC (Voorhees & Tice, 2000).
Dataset Splits Yes We use the full validation set for evaluation except for AGNews, DBPedia, and AP, for which we randomly sample 2000 test examples. We compute the average accuracy on the validation set over five random seeds for each setting except for Bloom using 2 seeds.
Hardware Specification Yes We conduct the evaluation on 8 Tesla A100 GPUs for Bloom and Tesla V100 GPUs for other models.
Software Dependencies No The paper mentions using "GPT-family" models (GPT-2-large, GPT-2-XL, GPT-neo, GPT-J, Bloom) but does not specify software dependencies such as Python versions, deep learning frameworks (e.g., PyTorch, TensorFlow), or specific library versions with numerical identifiers.
Experiment Setup Yes We use the k-means algorithm to initialize GMM parameters to accelerate the convergence. The maximum iterations and the convergence threshold for each EM process are set to 100 and 1e-3 respectively. Moreover, we repeat the estimation multiple times with different random seeds.