Catformer: Designing Stable Transformers via Sensitivity Analysis
Authors: Jared Q Davis, Albert Gu, Krzysztof Choromanski, Tri Dao, Christopher Re, Chelsea Finn, Percy Liang
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL the state-of-the-art architecture designed to address stability by 13%. |
| Researcher Affiliation | Collaboration | 1Stanford University 2Deep Mind 3Google Brain Robotics 4Columbia University. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described. |
| Open Datasets | Yes | Finally, we apply the Catformer on visual-navigation tasks from Deep Mind Lab (Beattie et al., 2016), a suite of challenging RL tasks with high-dimensional observations, complex action spaces, and partial observability. |
| Dataset Splits | No | Table 2 shows final validation losses for several models from Table 1. However, the paper does not specify the exact percentages or sample counts for train/validation/test splits, nor does it reference predefined splits with explicit citations for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Adam optimizer' but does not specify version numbers for any software components, programming languages, or libraries used in the experiments. |
| Experiment Setup | Yes | We use N = 2, 4, and 6 layer Transformer architectures. All models use pre-Layer Norm unless otherwise speciļ¬ed. For all baselines with constant dimension size in their layers... we use the standard Transformer architecture with d = 512 and inner dimension 4d = 2048... The Catformer model is parameter controlled with the technique described in Appendix D with ea = 2, ef = 4. Models are trained with the Adam optimizer. Table 3 and Table 4 provide specific learning rates (0.1, 0.2, 0.4) and noise std. (0.0005, 0.001, 0.002) respectively. |