How Important Is Weight Symmetry in Backpropagation?
Authors: Qianli Liao, Joel Leibo, Tomaso Poggio
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using 15 different classification datasets, we systematically investigate to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al. s demonstration (Lillicrap et al. 2014) but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) (Ioffe and Szegedy 2015) and/or a Batch Manhattan (BM) update rule. |
| Researcher Affiliation | Academia | Qianli Liao and Joel Z. Leibo and Tomaso Poggio Center for Brains, Minds and Machines, Mc Govern Institute Massachusetts Institute of Technology 77 Massachusetts Ave., Cambridge, MA, 02139, USA |
| Pseudocode | No | The paper describes algorithms and update rules using mathematical notation and descriptive text (e.g., in Section 2 and 3) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement about making its source code publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We extensively test our algorithms on 15 datasets of 5 Categories as described below. No data augmentation (e.g., cropping, flip, etc.) is used in any of the experiments. Machine learning tasks: MNIST (Le Cun, Cortes, and Burges ), CIFAR-10 (Krizhevsky 2009), CIFAR-100 (Krizhevsky 2009), SVHN(Netzer et al. 2011), STL10 (Coates, Ng, and Lee 2011). Standard training and testing splits were used. Basic-level categorization tasks: Caltech101 (Fei-Fei, Fergus, and Perona 2007): 102 classes, 30 training and 10 testing samples per class. Caltech256-101 (Griffin, Holub, and Perona 2007): we train/test on a subset of randomly sampled 102 classes. 30 training and 10 testing per class. i Cub World dataset (Fanello et al. 2013): We followed the standard categorization protocol of this dataset. Fine-grained recognition tasks: Flowers17 (Nilsback and Zisserman 2006), Flowers102 (Nilsback and Zisserman 2008). Standard training and testing splits were used. Face Identification: Pubfig83-ID (Pinto et al. 2011), SUFRW-ID (Leibo, Liao, and Poggio 2014), LFW-ID (Huang et al. 2008) We did not follow the usual (verification) protocol of these datasets. Instead, we performed a 80-way face identification task on each dataset, where the 80 identities (IDs) were randomly sampled. Pubfig83: 85 training and 15 testing samples per ID. SUFR-W: 10 training and 5 testing per ID. LFW: 10 training and 5 testing per ID. Scene recognition: MIT-indoor67 (Quattoni and Torralba 2009): 67 classes, 80 training and 20 testing per class Non-visual task: TIMIT-80 (Garofolo et al. ): Phoneme recognition using a fully-connected network. There are 80 classes, 400 training and 100 testing samples per class. |
| Dataset Splits | No | The paper mentions 'The best validation error among all epochs of 5 runs was recorded' and 'Standard training and testing splits were used' for some datasets, and specific training/testing sample counts for others. However, it does not explicitly provide the specific percentages or sample counts for training, validation, and testing splits for all datasets to fully reproduce the partitioning. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper refers to common algorithms and techniques (e.g., Batch Normalization, SGD) and mentions uses of 'mini-batch', but it does not specify any software dependencies with version numbers (e.g., specific libraries, frameworks, or programming language versions). |
| Experiment Setup | Yes | Momentum was used with hyperparameter 0.9 (a conventional setting). All experiments were run for 65 epochs. The base learning rate: 1 to 50 epochs 5 × 10−4, 51 to 60 epochs 5 × 10−5, and 61 to 65 epochs 5 × 10−6. All models were run 5 times on each dataset with base learning rate multiplied by 100, 10, 1, 0.1, 0.01 respectively. This is because different learning algorithms favor different magnitudes of learning rates. The best validation error among all epochs of 5 runs was recorded as each model s final performance. The batch sizes were all set to 100 unless stated otherwise. All experiments used a softmax for classification and the crossentropy loss function. For testing with batch normalization, we compute exponential moving averages (alpha=0.05) of training means and standard deviations over 20 mini batches after each training epoch. |