Position: Do pretrained Transformers Learn In-Context by Gradient Descent?

Authors: Lingfeng Shen, Aayush Mishra, Daniel Khashabi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLa Ma-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations.
Researcher Affiliation Academia 1Johns Hopkins University, Baltimore MD.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions GPT-Neo and GPT-J with URLs, but these refer to models/frameworks used by the authors, not the source code for the methodology described in this paper.
Open Datasets Yes For benchmarking, we select the following datasets: AGNews (Zhang et al., 2015), CB (De Marneffe et al., 2019), SST-2 (Socher et al., 2013), and RTE (Dagan et al., 2005).
Dataset Splits No The paper mentions evaluating models using a test set `Sf test` that is disjoint from `Sf` (demonstrations), but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts for a validation set).
Hardware Specification No GPU machines for conducting experiments were provided by ARCH Rockfish cluster at Johns Hopkins University (https://www.arch.jhu.edu). This is a general statement about the cluster used, but lacks specific details on GPU models, CPU types, or memory.
Software Dependencies No The paper mentions models like LLa Ma (7B), GPT-J, but does not provide specific version numbers for software dependencies or libraries (e.g., PyTorch version, Python version).
Experiment Setup Yes We evaluate ICL with varying demonstration sizes N {1, 2, 4, 8} and for GD, we finetune the models with the same corresponding ICL demonstrations, experimenting with a variety of learning rates {1e-4, 5e-4, 1e-5, 5e-5} over 200 epochs, which ensures the convergence of model.