Position: Do pretrained Transformers Learn In-Context by Gradient Descent?
Authors: Lingfeng Shen, Aayush Mishra, Daniel Khashabi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive empirical analyses on language models pre-trained on natural data (LLa Ma-7B). Our comparisons of three performance metrics highlight the inconsistent behavior of ICL and GD as a function of various factors such as datasets, models, and the number of demonstrations. |
| Researcher Affiliation | Academia | 1Johns Hopkins University, Baltimore MD. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions GPT-Neo and GPT-J with URLs, but these refer to models/frameworks used by the authors, not the source code for the methodology described in this paper. |
| Open Datasets | Yes | For benchmarking, we select the following datasets: AGNews (Zhang et al., 2015), CB (De Marneffe et al., 2019), SST-2 (Socher et al., 2013), and RTE (Dagan et al., 2005). |
| Dataset Splits | No | The paper mentions evaluating models using a test set `Sf test` that is disjoint from `Sf` (demonstrations), but it does not specify explicit training/validation/test dataset splits (e.g., percentages or counts for a validation set). |
| Hardware Specification | No | GPU machines for conducting experiments were provided by ARCH Rockfish cluster at Johns Hopkins University (https://www.arch.jhu.edu). This is a general statement about the cluster used, but lacks specific details on GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper mentions models like LLa Ma (7B), GPT-J, but does not provide specific version numbers for software dependencies or libraries (e.g., PyTorch version, Python version). |
| Experiment Setup | Yes | We evaluate ICL with varying demonstration sizes N {1, 2, 4, 8} and for GD, we finetune the models with the same corresponding ICL demonstrations, experimenting with a variety of learning rates {1e-4, 5e-4, 1e-5, 5e-5} over 200 epochs, which ensures the convergence of model. |