Beyond accuracy: generalization properties of bio-plausible temporal credit assignment rules

Authors: Yuhan Helena Liu, Arna Ghosh, Blake Richards, Eric Shea-Brown, Guillaume Lajoie

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Leveraging results from deep learning theory based on loss landscape curvature, we ask: how do biologically-plausible gradient approximations affect generalization? We first demonstrate that state-of-the-art biologically-plausible learning rules for training RNNs exhibit worse and more variable generalization performance compared to their machine learning counterparts that follow the true gradient more closely. Next, we verify that such generalization performance is correlated significantly with loss landscape curvature, and we show that biologically-plausible learning rules tend to approach high-curvature regions in synaptic weight space. Using tools from dynamical systems, we derive theoretical arguments and present a theorem explaining this phenomenon. This predicts our numerical results, and explains why biologically-plausible rules lead to worse and more variable generalization properties.
Researcher Affiliation Collaboration Yuhan Helena Liu1,2,3,*, Arna Ghosh4,5, Blake A. Richards4,5,6,7, Eric Shea-Brown1,2,3, and Guillaume Lajoie5,7,8,* 1Department of Applied Mathematics, University of Washington, Seattle, WA, USA 2Allen Institute for Brain Science, 615 Westlake Ave N, Seattle WA, USA 3Computational Neuroscience Center, University of Washington, Seattle, WA, USA 4School of Computer Science, Mc Gill University, Montreal, QC, Canada 5Mila Quebec AI Institute, Montreal, QC, Canada 6Department of Neurology and Neurosurgery, Montreal Neurological Institute, Mc Gill University, Montreal, QC, Canada 7Canada CIFAR AI Chair, CIFAR, Toronto, ON, Canada 8Dept. de Mathématiques et Statistiques, Université de Montréal, Montreal, QC, Canada *Correspondence: hyliu24@uw.edu, g.lajoie@umontreal.ca
Pseudocode No The paper describes mathematical equations and theoretical concepts but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Anonymized code link is provided in Appendix A.3.
Open Datasets Yes We performed experiments on three tasks: sequential MNIST [137], pattern generation [138] and delayed match-to-sample tasks [139]. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
Dataset Splits No The paper mentions training and test accuracy, but does not explicitly detail training, validation, and test dataset splits with percentages or sample counts. It refers to 'Appendix A.3' for training details, but the main text does not contain this specific information.
Hardware Specification No The paper states: 'Information pertaining to computing resources and simulation time can be found in Appendix A.3.' However, Appendix A.3 is not provided, and the main text does not contain specific hardware details like GPU/CPU models or memory.
Software Dependencies No The paper does not provide specific version numbers for software dependencies. It only mentions 'TensorFlow' in the references: '{Tensor Flow}: A system for {Large-Scale} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265 283, 2016.'
Experiment Setup Yes The detailed governing equations of our setup can be found in Methods (Appendix A). We consider a RNN with Nin input units, N hidden units and Nout readout units (Figure 1A). We verified that trends hold for different network sizes and refer the reader to Appendix A.3 for more details. The update formula for ht RN (the hidden state at time t) is governed by: ht+1 = φ(Whf(ht), Wxxt), (1) where φ( ) : RN RN is the hidden state update function, f( ) : RN RN is the activation function, Wh RN N (resp. Wx RNin N) is the recurrent (resp. input) weight matrix and x RNin is the input. For φ, we consider a discrete-time implementation of a rate-based recurrent neural network (RNN) similar to the form in [136] (details in Appendix A). Readout ˆy RNout, with readout weights w RNout N, is defined as ˆy = w, f(ht) . (2) We performed experiments on three tasks: sequential MNIST [137], pattern generation [138] and delayed match-to-sample tasks [139]. The objective is to minimize scalar loss L R, which is defined as ... Different learning algorithms examined in this work are BPTT (our benchmark), which update weights by computing the exact gradient ( L(Wh) RN N): Wh = η L(Wh), (4) and three So TA bio-plausible learning rules that update weights using approximate gradient: [ Wh = η L(Wh), (5) where L(Wh) RN N denotes a gradient approximation and η R denotes the learning rate.