reproducibilityindex.ai

$k$NN Prompting: Beyond-Context Learning with Calibration-Free Nearest Neighbor Inference

Authors: Benfeng Xu, Quan Wang, Zhendong Mao, Yajuan Lyu, Qiaoqiao She, Yongdong Zhang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments to demonstrate its two-fold superiority: 1) Calibration-Free: k NN Prompting does not directly align LLM output distribution with task-speciﬁc label space, instead leverages such distribution to align test and training instances. It signiﬁcantly outperforms state-of-the-art calibration-based methods under comparable few-shot scenario. 2) Beyond-Context: k NN Prompting can further scale up effectively with as many training data as are available, continually bringing substantial improvements.
Researcher Affiliation	Collaboration	1University of Science and Technology of China, Hefei, China 2Beijing University of Posts and Telecommunications, Beijing, China 3Baidu Inc., Beijing, China 4Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center, China
Pseudocode	No	The paper describes the k NN Prompting framework in Section 3 with textual explanations and mathematical equations, but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	Code is publicly available1. 1https://github.com/Benfeng Xu/KNNPrompting
Open Datasets	Yes	We use 10 established text classiﬁcation datasets, respectively SST2 (Socher et al., 2013), SUBJ (Pang & Lee, 2004), MPQA (Wiebe et al., 2005), AGNews (Zhang et al., 2015), CB (De Marneffe et al., 2019), CR (Hu & Liu, 2004), DBPedia (Zhang et al., 2015), MR (Pang & Lee, 2005), RTE (Dagan et al., 2005) and TREC (Voorhees & Tice, 2000).
Dataset Splits	No	The paper describes using 'training data set T' which is split into 'demonstration set D' and 'anchor set A' for its kNN Prompting method, and discusses 'test instance xtest'. It also refers to 'Num. of Shots' (training data size). However, it does not explicitly state conventional training/validation/test dataset splits (e.g., 80/10/10 split) or mention a dedicated 'validation set' for hyperparameter tuning. The term 'validation' is not used in the context of data splits for reproducibility.
Hardware Specification	No	The paper mentions the use of various LLMs (e.g., GPT2, OPT series) with different parameter scales (0.8B to 30B), but it does not provide any specific hardware specifications such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions using 'GPT2 tokenizer' and refers to models like 'GPT2' and 'OPT series', but it does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, specific library versions).
Experiment Setup	Yes	We invariantly set the number of neighbors k to 3. There are no other hyper-parameters as the entire framework is training-free. [...] We set learning rate to 1e-5, batch size to 16, and training steps to 125, 250 or 500, respectively for m {32, 64}, {128, 256}, {512, 1024}. For CB, AGNews and RTE, batch size is adjusted to 8, for DBPedia, batch size is adjusted to 4 to avoid OOM.