reproducibilityindex.ai

Probing the Decision Boundaries of In-context Learning in Large Language Models

Authors: Siyan Zhao, Tung Nguyen, Aditya Grover

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we propose a new mechanism to probe and understand in-context learning from the lens of decision boundaries for in-context binary classification. ... This paper investigates the factors influencing these decision boundaries and explores methods to enhance their generalizability. We assess various approaches, including training-free and fine-tuning methods for LLMs, the impact of model architecture, and the effectiveness of active prompting techniques for smoothing decision boundaries in a data-efficient manner.
Researcher Affiliation	Academia	Siyan Zhao, Tung Nguyen, Aditya Grover Department of Computer Science University of California Los Angeles {siyanz,tungnd,adityag}@cs.ucla.edu
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is released at https://github.com/siyan-zhao/ICL_decision_boundary.
Open Datasets	Yes	We generate classification datasets using scikit-learn [Pedregosa et al., 2011], creating three types of linear and non-linear classification tasks: linear, circle, and moon, each describing different shapes of ground-truth decision boundaries. Detailed information on the dataset generation can be found in Appendix G.
Dataset Splits	No	The paper describes 'training tasks' and 'testing tasks' and their parameters for dataset generation in Appendix G, but does not explicitly mention a 'validation' split.
Hardware Specification	No	The paper mentions generating decision boundaries with '8-bit quantization due to computational constraints,' implying hardware was used, but it does not specify any particular GPU/CPU models, processor types, or memory amounts.
Software Dependencies	No	The paper mentions using 'scikit-learn [Pedregosa et al., 2011]' for dataset generation but does not provide a specific version number for scikit-learn or any other software dependency.
Experiment Setup	Yes	We choose a grid size scale of 50 x 50, resulting in 2500 queries for each decision boundary. ... To do this, we finetune a pretrained Llama model [Touvron et al., 2023] on a set of 1000 binary classification tasks... For each task, we sample randomly N = 256 data points... We then sample the number of context points m U[8, 128], and finetune the LLM to predict yi>m given xi>m and the preceding examples... We finetune the pretrained LLM using Lo RA [Hu et al., 2021] and finetune the attention layers. ... In our experiments, we used several classical machine learning models with the following hyperparameters: Decision Tree Classifier: We set the maximum depth of the tree to 3. Multi-Layer Perceptron: The neural network consists of two hidden layers, each with 256 neurons, and the maximum number of iterations is set to 1000. K-Nearest Neighbors: The number of neighbors is set to 5. Support Vector Machine (SVM): We used a radial basis function (RBF) kernel with a gamma value of 0.2.