reproducibilityindex.ai

Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering

Authors: Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, Subhrajit Roy

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of BC with Pa LM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We conducted extensive experiments on more than 10 natural language understanding tasks together with image classification tasks.
Researcher Affiliation	Collaboration	Han Zhou1,2, Xingchen Wan1 Lev Proleev1 Diana Mincu1 Jilin Chen1 Katherine Heller1 Subhrajit Roy1 1Google Research 2University of Cambridge
Pseudocode	No	The paper describes methods using natural language and mathematical equations (e.g., Eq. 1, Eq. 2, Eq. 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code or a link to a code repository for the methodology described.
Open Datasets	Yes	For natural language tasks... we conduct experiments on 13 more diverse and challenging classification tasks, including the standard GLUE (Wang et al., 2018) and Super GLUE (Wang et al., 2019) datasets. Specifically, we consider commonsense reasoning: Bool Q (Clark et al., 2019), COPA (Roemmele et al., 2011); word disambiguation: Wi C (Pilehvar & Camacho-Collados, 2019); sentiment classification: SST-2 (Socher et al., 2013); paraphrasing: QQP, MRPC (Dolan & Brockett, 2005); natural language inference and entailment: ANLI-R{1,2,3} (Nie et al., 2020), CB (De Marneffe et al., 2019), RTE, QNLI (QA/NLI), MNLI (Williams et al., 2018). For image classification tasks, we include SVHN (Yuval, 2011), Euro SAT (Helber et al., 2019), and CLEVR (Johnson et al., 2017). These are all publicly available datasets and are cited appropriately.
Dataset Splits	Yes	\|Test\| denotes the number of test samples, where we consistently use the validation split as the test split because labels are not publicly available for some datasets.
Hardware Specification	No	The paper mentions the use of models like PaLM 2 (Pa LM 2-S, Pa LM 2-M, and Pa LM 2-L) and CLIP Vi T-B/16 but does not specify the underlying hardware (e.g., specific GPUs, CPUs, or TPU versions) used for training or inference.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python version, PyTorch/TensorFlow version, specific library versions) that would be needed to reproduce the experiments.
Experiment Setup	Yes	To select the appropriate γ, we simply perform a grid search by uniformly sampling T different γ values in [a, b] (we set [a, b] := [ 5, 5], but any reasonable range may be used). In the n-shot ICL experiments reported in Table 2 and Fig. 6, the k-shot ICL is concatenating k random training sample per class. We report the mean and standard deviation for all results for 5 different in-context examples. For a fair comparison, we use the same test set as the unlabeled estimate set for PC. We follow the same hyper-parameters reported by PC with 100 maximum iterations for EM and 100 times random initialization for the whole learning process to stabilize its estimation. Setup details are listed in Appendix F.