An Explanation of In-context Learning as Implicit Bayesian Inference
Authors: Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning. 4 SIMULATIONS We generate the GINC dataset and show that Transformers (Vaswani et al., 2017) and LSTMs (Hochreiter & Schmidhuber, 1997) trained on GINC exhibit incontext learning. |
| Researcher Affiliation | Academia | Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma Stanford University {xie,aditir,pliang,tengyuma}@cs.stanford.edu |
| Pseudocode | No | The paper does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | 1The code, data, and experiments are located on Git Hub and Coda Lab. |
| Open Datasets | Yes | 1The code, data, and experiments are located on Git Hub and Coda Lab. GINC dataset. We construct the GINC dataset according to our theory (see Appendix F.1). |
| Dataset Splits | Yes | The dataset contains 1000 training documents and 100 validation documents, where training documents have 10240 tokens and validation documents have 1024 tokens. |
| Hardware Specification | Yes | The hardware was mainly Titan Xp GPUs, trained and evaluated using 16-bit precision. |
| Software Dependencies | No | The paper mentions using 'Hugging Face library' and the 'Adam W optimizer' but does not provide specific version numbers for these software components or any other libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | Our Transformer models are based on the GPT-2 architectures with 4, 12, and 16 layers respectively, with 12 attention heads, 768 dimensional embeddings, residual/embedding/attention dropout set to 0.1, and a context window of 1024. We train for 5 epochs using the Adam W optimizer... with a batch size of 8 and a linear learning rate schedule (with 1000 step warmup) up to a learning rate of 8e-4... |