Transformer with Memory Replay

Authors: Rui Liu, Barzan Mozafari7567-7575

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on GLUE and SQu AD benchmark datasets show that Transformer with Memory Replay achieves at least 1% point increase compared to the baseline transformer model when pretrained with the same number of examples.
Researcher Affiliation Academia Rui Liu, Barzan Mozafari Computer Science and Engineering, University of Michigan, Ann Arbor {ruixliu, mozafari}@umich.edu
Pseudocode No The paper refers to an algorithm from another paper but does not provide pseudocode or an algorithm block within its own content.
Open Source Code No The paper does not provide an explicit statement or link to its open-source code.
Open Datasets Yes We pre-train our model with two different sizes: a small model and a base model on English Wikipedia. We use two commonly used datasets as the benchmark to evaluate performance: General Language Understanding Evaluation (GLUE) (Wang et al. 2018) and Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al. 2016).
Dataset Splits Yes We use two commonly used datasets as the benchmark to evaluate performance: General Language Understanding Evaluation (GLUE) (Wang et al. 2018) and Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al. 2016). Unless stated otherwise, results are on the dev set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using 'Adam with warmup' for pre-training but does not provide specific software dependencies with version numbers (e.g., PyTorch version, TensorFlow version, or other library versions).
Experiment Setup Yes We use Adam with warmup to pre-train the models. The detailed setup is the same as Clark et al. (2020) if not stated otherwise. Specifically, we set ϵ = 1e 6, β1 = 0.9 and β2 = 0.999. The mini-batch size is 128 for the small model and 256 for the base model. The memory buffer size N is set to 1k in our experiments.