Cached Transformers: Improving Transformers with Differentiable Memory Cachde
Authors: Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring longrange dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in six language and vision tasks, including language modeling, machine translation, List OPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations. |
| Researcher Affiliation | Collaboration | Zhaoyang Zhang1, Wenqi Shao1, Yixiao Ge2, Xiaogang Wang1, Jinwei Gu1, Ping Luo3 1The Chinese University of Hong Kong 2Tencent Inc. 3The University of Hong Kong {zhaoyangzhang@link., wqshao@link., xgwang@ee., jwgu@}cuhk.edu.hk, geyixiao831@gmail.com, pluo@cs.hku.hk |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions using "publicly available fairseq framework" but does not state that their own implementation code is released. |
| Open Datasets | Yes | This section extensively evaluates the effectiveness of the proposed Cached Transformer and Gated Recurrent Cache (GRC) in both vision and language tasks, including language modeling on Wiki Text-103, Long Listops of Long Range Arena (Tay et al. 2021a), machine translation on IWSLT14 (Cettolo et al. 2014) / IWSLT15 (Cettolo et al. 2015), image classification on Image Net (Krizhevsky, Sutskever, and Hinton 2012), and object detection and instance segmentation on COCO2017 (Lin et al. 2014). |
| Dataset Splits | Yes | The models are trained on the COCO train2017 (118k images) and evaluated on val2017 (5k images). |
| Hardware Specification | Yes | All of the experiments are conducted on Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using the "fairseq framework" but does not provide specific version numbers for fairseq or other key software components, which is required for reproducibility. |
| Experiment Setup | Yes | For image classification task, we set the cache ratio r to be 0.5 and keep cache length Tm equal to the length of image patches T. For fair comparisons, we directly replace the self-attention layers in corresponding transformers with our GRC-Attention module without varying the architecture and hyperparameters. To maintain spatial token structures, we add positional encodings to our proposed GRC-Attention like other vision transformers. Both the baselines and their cached counterparts are trained with 224 224 size inputs using 16 GPUs. To fully validate the proposed cache mechanism, we evaluate GRCAttention on four recent vision transformers including: Vi Ts (Dosovitskiy et al. 2021), PVT (Wang et al. 2021), Swin Transformer (Liu et al. 2021) and PVT-v2 (Wang et al. 2022). Without bells and whistles, all of the training settings for cached models are kept consistent with the original baselines including data augmentation, optimizer type, learning rates and training epochs. |