Catformer: Designing Stable Transformers via Sensitivity Analysis

Authors: Jared Q Davis, Albert Gu, Krzysztof Choromanski, Tri Dao, Christopher Re, Chelsea Finn, Percy Liang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL the state-of-the-art architecture designed to address stability by 13%.
Researcher Affiliation Collaboration 1Stanford University 2Deep Mind 3Google Brain Robotics 4Columbia University.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described.
Open Datasets Yes Finally, we apply the Catformer on visual-navigation tasks from Deep Mind Lab (Beattie et al., 2016), a suite of challenging RL tasks with high-dimensional observations, complex action spaces, and partial observability.
Dataset Splits No Table 2 shows final validation losses for several models from Table 1. However, the paper does not specify the exact percentages or sample counts for train/validation/test splits, nor does it reference predefined splits with explicit citations for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions 'Adam optimizer' but does not specify version numbers for any software components, programming languages, or libraries used in the experiments.
Experiment Setup Yes We use N = 2, 4, and 6 layer Transformer architectures. All models use pre-Layer Norm unless otherwise specified. For all baselines with constant dimension size in their layers... we use the standard Transformer architecture with d = 512 and inner dimension 4d = 2048... The Catformer model is parameter controlled with the technique described in Appendix D with ea = 2, ef = 4. Models are trained with the Adam optimizer. Table 3 and Table 4 provide specific learning rates (0.1, 0.2, 0.4) and noise std. (0.0005, 0.001, 0.002) respectively.