Transformer with Memory Replay
Authors: Rui Liu, Barzan Mozafari7567-7575
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on GLUE and SQu AD benchmark datasets show that Transformer with Memory Replay achieves at least 1% point increase compared to the baseline transformer model when pretrained with the same number of examples. |
| Researcher Affiliation | Academia | Rui Liu, Barzan Mozafari Computer Science and Engineering, University of Michigan, Ann Arbor {ruixliu, mozafari}@umich.edu |
| Pseudocode | No | The paper refers to an algorithm from another paper but does not provide pseudocode or an algorithm block within its own content. |
| Open Source Code | No | The paper does not provide an explicit statement or link to its open-source code. |
| Open Datasets | Yes | We pre-train our model with two different sizes: a small model and a base model on English Wikipedia. We use two commonly used datasets as the benchmark to evaluate performance: General Language Understanding Evaluation (GLUE) (Wang et al. 2018) and Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al. 2016). |
| Dataset Splits | Yes | We use two commonly used datasets as the benchmark to evaluate performance: General Language Understanding Evaluation (GLUE) (Wang et al. 2018) and Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al. 2016). Unless stated otherwise, results are on the dev set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Adam with warmup' for pre-training but does not provide specific software dependencies with version numbers (e.g., PyTorch version, TensorFlow version, or other library versions). |
| Experiment Setup | Yes | We use Adam with warmup to pre-train the models. The detailed setup is the same as Clark et al. (2020) if not stated otherwise. Specifically, we set ϵ = 1e 6, β1 = 0.9 and β2 = 0.999. The mini-batch size is 128 for the small model and 256 for the base model. The memory buffer size N is set to 1k in our experiments. |