Efficient Representation Learning via Adaptive Context Pooling
Authors: Chen Huang, Walter Talbott, Navdeep Jaitly, Joshua M Susskind
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments validate that our Context Pool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to Conv Nets for efficient feature learning. |
| Researcher Affiliation | Industry | 1Apple Inc., Cupertino, United States. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code or provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | We use both the WMT 2014 Englishto-German (EN-DE) dataset with about 4.5 million English German sentence pairs, and the and English-French (ENFR) dataset with about 36 million English-French sentence pairs. A token is a byte pair or a word piece as in (Vaswani et al., 2017). We use enwik8 and text8 datasets, each with 100M characters and 90M/5M/5M for train/dev/test as in (Mahoney, 2009). We benchmark different transformer models on the widely used Image Net-1K classification dataset (Deng et al., 2009). |
| Dataset Splits | Yes | We use enwik8 and text8 datasets, each with 100M characters and 90M/5M/5M for train/dev/test as in (Mahoney, 2009). There are 1.28M training and 50k validation images from 1k classes. |
| Hardware Specification | Yes | Speed (steps / s) is measured on a V100 GPU. |
| Software Dependencies | No | We use Adam optimizer with the same learning rate schedule in (Vaswani et al., 2017). We train for 300 epochs with the Adam W optimizer, using a cosine decay learning rate scheduler and linear warm-up (20 epochs). |
| Experiment Setup | Yes | For our method, we insert Context Pool after every attention layer. Following (Li et al., 2019b), we train for 250k iterations for Small and Base models, and for 600k iterations with a smaller batch size for the Big model due to the memory constraint. We use Adam optimizer with the same learning rate schedule in (Vaswani et al., 2017). We train our models in 3 stages with increasing sequence lengths (2048, 4096, 8192) and different batch sizes (32, 32, 16). All models are trained for a total of 530k steps with linear learning rate warmup. We also use dropout rates 0.2 and weight decays 0.01. We train for 300 epochs with the Adam W optimizer, using a cosine decay learning rate scheduler and linear warm-up (20 epochs)... We have batch size 1024, initial learning rate 0.001, weight decay 0.05, and the max norm of gradient clipping 1. |