Implicit meta-learning may lead language models to trust more reliable sources
Authors: Dmitrii Krasheninnikov, Egor Krasheninnikov, Bruno Kacper Mlodozeniec, Tegan Maharaj, David Krueger
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a thorough empirical investigation of this phenomenon, finding (among other things) that (i) it occurs in both pretrained LLMs and those trained from scratch, as well as on a vision task, and (ii) larger models and smaller batch sizes tend to give more IML.Our code & data are available at github.com/krasheninnikov/internalization. |
| Researcher Affiliation | Academia | 1University of Cambridge 2Max Planck Institute for Intelligent Systems 3University of Toronto. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code & data are available at github.com/krasheninnikov/internalization. |
| Open Datasets | Yes | We fine-tune the 2.8B parameter Pythia model (Biderman et al., 2023), a decoder-only transformer pre-trained on the Pile dataset (Gao et al., 2020), on a dataset of definitions and QA pairs, with the causal language modelling objective (i.e. autoregressive).We use the Cross Verifed database (CVDB) (Laouenan et al., 2022) of famous people.Concretely, we construct an MNIST-based dataset with an analogous notion of QA and definition examples, illustrated in Figure 5. (Deng, 2012). |
| Dataset Splits | Yes | 70% of the entities are randomly assigned to X1, and the remainder are assigned to X2. Then, these entity groups are randomly split into the various subsets of X1 and X2.Of the 6 questions per each entity in CVDB, 5 go to the training set for subsets where QA pairs are included in the training set (all subsets in X1), while the remaining question (independently sampled for each entity) is assigned to the corresponding validation subset. |
| Hardware Specification | Yes | We estimate our total compute usage for this project at around 20k hours with NVIDIA A100-80gb GPUs.Training Conv Ne Xt V2 Tiny for the MNIST experiment takes about 2 hours on a NVIDIA 4090Ti. |
| Software Dependencies | No | The paper mentions software like 'Hugging Face Transformers library', 'Adafactor optimizer', and 'Conv Ne Xt V2 model', but does not provide specific version numbers for these components. |
| Experiment Setup | Yes | We use the Adafactor optimizer (Shazeer & Stern, 2018) with the batch size of 256 datapoints. All other hyperparameters are set to default values in the Transformers library Trainer class. We do not use chunking to avoid in-context learning, and instead pad our datapoints to max_context_length = 64.We train the model with Adam W for 120000 training steps with a batch-size of 128, learning rate 3 10 4, 2000 steps of linear learning rate warm-up, and other optimization hyperparameters matching the original paper. |