reproducibilityindex.ai

Position: Do pretrained Transformers Learn In-Context by Gradient Descent?

Authors: Lingfeng Shen, Aayush Mishra, Daniel Khashabi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLa Ma-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations.
Researcher Affiliation	Academia	1Johns Hopkins University, Baltimore MD.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions GPT-Neo and GPT-J with URLs, but these refer to models/frameworks used by the authors, not the source code for the methodology described in this paper.
Open Datasets	Yes	For benchmarking, we select the following datasets: AGNews (Zhang et al., 2015), CB (De Marneffe et al., 2019), SST-2 (Socher et al., 2013), and RTE (Dagan et al., 2005).
Dataset Splits	No	The paper mentions evaluating models using a test set `Sf test` that is disjoint from `Sf` (demonstrations), but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts for a validation set).
Hardware Specification	No	GPU machines for conducting experiments were provided by ARCH Rockfish cluster at Johns Hopkins University (https://www.arch.jhu.edu). This is a general statement about the cluster used, but lacks specific details on GPU models, CPU types, or memory.
Software Dependencies	No	The paper mentions models like LLa Ma (7B), GPT-J, but does not provide specific version numbers for software dependencies or libraries (e.g., PyTorch version, Python version).
Experiment Setup	Yes	We evaluate ICL with varying demonstration sizes N {1, 2, 4, 8} and for GD, we finetune the models with the same corresponding ICL demonstrations, experimenting with a variety of learning rates {1e-4, 5e-4, 1e-5, 5e-5} over 200 epochs, which ensures the convergence of model.