Prototypical Calibration for Few-shot Learning of Language Models
Authors: Zhixiong Han, Yaru Hao, Li Dong, Yutao Sun, Furu Wei
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that prototypical calibration yields a substantial improvement on a diverse set of tasks. Extensive analysis across different scales also indicates that our method calibrates the decision boundary as expected, greatly improving the robustness of GPT to templates, permutations, and class imbalance. |
| Researcher Affiliation | Industry | Zhixiong Han , Yaru Hao, Li Dong, Yutao Sun , Furu Wei Microsoft Research {zhixhan8,sunyutao20001121}@gmail.com, {yaruhao,lidong1,fuwei}@microsoft.com |
| Pseudocode | No | The paper describes the proposed method, including equations and a step-by-step explanation (e.g., "PROTOTYPICAL CLUSTER ESTIMATION", "CLUSTER-LABEL ASSIGNMENT", "INFERENCE"), but it does not include a formally labeled "Pseudocode" or "Algorithm" block. |
| Open Source Code | Yes | The code will be released at https://github.com/zhixhan/Pro Ca. |
| Open Datasets | Yes | We evaluate the proposed method on nine widely-used text-classification datasets including SST2 (Socher et al., 2013), SST-5 (Socher et al., 2013), Subj (Pang & Lee, 2004), MR (Pang & Lee, 2005), AP (Zhang et al., 2015), DBPedia (Zhang et al., 2015), AGNews (Zhang et al., 2015), RTE (Dagan et al., 2005), and TREC (Voorhees & Tice, 2000). |
| Dataset Splits | Yes | We use the full validation set for evaluation except for AGNews, DBPedia, and AP, for which we randomly sample 2000 test examples. We compute the average accuracy on the validation set over five random seeds for each setting except for Bloom using 2 seeds. |
| Hardware Specification | Yes | We conduct the evaluation on 8 Tesla A100 GPUs for Bloom and Tesla V100 GPUs for other models. |
| Software Dependencies | No | The paper mentions using "GPT-family" models (GPT-2-large, GPT-2-XL, GPT-neo, GPT-J, Bloom) but does not specify software dependencies such as Python versions, deep learning frameworks (e.g., PyTorch, TensorFlow), or specific library versions with numerical identifiers. |
| Experiment Setup | Yes | We use the k-means algorithm to initialize GMM parameters to accelerate the convergence. The maximum iterations and the convergence threshold for each EM process are set to 100 and 1e-3 respectively. Moreover, we repeat the estimation multiple times with different random seeds. |