reproducibilityindex.ai

Efficient Representation Learning via Adaptive Context Pooling

Authors: Chen Huang, Walter Talbott, Navdeep Jaitly, Joshua M Susskind

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments validate that our Context Pool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to Conv Nets for efﬁcient feature learning.
Researcher Affiliation	Industry	1Apple Inc., Cupertino, United States.
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code or provide a link to a code repository for the described methodology.
Open Datasets	Yes	We use both the WMT 2014 Englishto-German (EN-DE) dataset with about 4.5 million English German sentence pairs, and the and English-French (ENFR) dataset with about 36 million English-French sentence pairs. A token is a byte pair or a word piece as in (Vaswani et al., 2017). We use enwik8 and text8 datasets, each with 100M characters and 90M/5M/5M for train/dev/test as in (Mahoney, 2009). We benchmark different transformer models on the widely used Image Net-1K classiﬁcation dataset (Deng et al., 2009).
Dataset Splits	Yes	We use enwik8 and text8 datasets, each with 100M characters and 90M/5M/5M for train/dev/test as in (Mahoney, 2009). There are 1.28M training and 50k validation images from 1k classes.
Hardware Specification	Yes	Speed (steps / s) is measured on a V100 GPU.
Software Dependencies	No	We use Adam optimizer with the same learning rate schedule in (Vaswani et al., 2017). We train for 300 epochs with the Adam W optimizer, using a cosine decay learning rate scheduler and linear warm-up (20 epochs).
Experiment Setup	Yes	For our method, we insert Context Pool after every attention layer. Following (Li et al., 2019b), we train for 250k iterations for Small and Base models, and for 600k iterations with a smaller batch size for the Big model due to the memory constraint. We use Adam optimizer with the same learning rate schedule in (Vaswani et al., 2017). We train our models in 3 stages with increasing sequence lengths (2048, 4096, 8192) and different batch sizes (32, 32, 16). All models are trained for a total of 530k steps with linear learning rate warmup. We also use dropout rates 0.2 and weight decays 0.01. We train for 300 epochs with the Adam W optimizer, using a cosine decay learning rate scheduler and linear warm-up (20 epochs)... We have batch size 1024, initial learning rate 0.001, weight decay 0.05, and the max norm of gradient clipping 1.