reproducibilityindex.ai

Catformer: Designing Stable Transformers via Sensitivity Analysis

Authors: Jared Q Davis, Albert Gu, Krzysztof Choromanski, Tri Dao, Christopher Re, Chelsea Finn, Percy Liang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that Catformers are less sensitive than other Transformer variants and demonstrate that this leads to more stable training. On DMLab30, a suite of high-dimension reinforcement tasks, Catformer outperforms other transformers, including Gated Transformer-XL the state-of-the-art architecture designed to address stability by 13%.
Researcher Affiliation	Collaboration	1Stanford University 2Deep Mind 3Google Brain Robotics 4Columbia University.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code (e.g., specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described.
Open Datasets	Yes	Finally, we apply the Catformer on visual-navigation tasks from Deep Mind Lab (Beattie et al., 2016), a suite of challenging RL tasks with high-dimensional observations, complex action spaces, and partial observability.
Dataset Splits	No	Table 2 shows final validation losses for several models from Table 1. However, the paper does not specify the exact percentages or sample counts for train/validation/test splits, nor does it reference predefined splits with explicit citations for reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions 'Adam optimizer' but does not specify version numbers for any software components, programming languages, or libraries used in the experiments.
Experiment Setup	Yes	We use N = 2, 4, and 6 layer Transformer architectures. All models use pre-Layer Norm unless otherwise speciﬁed. For all baselines with constant dimension size in their layers... we use the standard Transformer architecture with d = 512 and inner dimension 4d = 2048... The Catformer model is parameter controlled with the technique described in Appendix D with ea = 2, ef = 4. Models are trained with the Adam optimizer. Table 3 and Table 4 provide specific learning rates (0.1, 0.2, 0.4) and noise std. (0.0005, 0.001, 0.002) respectively.