Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens

Authors: Ruifeng Ren, Yong Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, experiments are designed to support our findings.
Researcher Affiliation Academia Ruifeng Ren Gaoling School of Artificial Intelligence Renmin University of China Beijing, China renruifeng920@ruc.edu.cn Yong Liu Gaoling School of Artificial Intelligence Renmin University of China Beijing, China liuyonggsai@ruc.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Answer: [Yes] Justification: We have provided our code and instructions in the supplemental material.
Open Datasets Yes We choose the BERT-base-uncased model (can be downloaded from Huggingface library[Wolf, 2019], hereafter referred to as BERT[Kenton and Toutanova, 2019]) to validate the effectiveness of modifications to the attention mechanism and select four relatively smaller GLUE datasets (Co LA, MRPC, STS-B, RTE) [Wang, 2018].
Dataset Splits No The paper describes the input structure for ICL inference (demonstration tokens and query tokens) and how some tokens are used as 'query tokens' for prediction. However, it does not provide specific dataset split information (e.g., percentages, sample counts, or explicit references to predefined validation splits) for training, validation, and test sets in the typical machine learning experimental setup, particularly for the synthetic tasks where data is generated on the fly. For GLUE datasets, it mentions batch size, learning rate, and epochs but not how the datasets themselves were split for validation.
Hardware Specification Yes The experiments are completed on a single 24GB NVIDIA Ge Force RTX 3090 and the experiments can be completed within one day. ... All experiments are conducted on a single 24GB NVIDIA Ge Force RTX 3090.
Software Dependencies Yes We choose the BERT-base-uncased model (can be downloaded from Huggingface library[Wolf, 2019], hereafter referred to as BERT[Kenton and Toutanova, 2019])
Experiment Setup Yes We set the dimension of the random features as dr = 100(dt + ds) = 1200 to obtain relatively accurate estimation. ... We choose stochastic gradient descent (SGD) [Amari, 1993] as the optimizer and we set the learning rate to 0.003 for normal and regularized models, while the remaining experiments to 0.005. ... we set the batch size to 32, the learning rate to 2e-5, and the number of epochs to 5 for all datasets.