Trainability, Expressivity and Interpretability in Gated Neural ODEs
Authors: Timothy Doyeon Kim, Tankut Can, Kamesh Krishnamurthy
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using a task that requires memory of continuous quantities, we demonstrate the inductive bias of the gn ODEs to learn (approximate) continuous attractors. We further show how reduceddimensional gn ODEs retain their modeling power while greatly improving interpretability, even allowing explicit visualization of the structure of learned attractors. We introduce a novel measure of expressivity which probes the capacity of a neural network to generate complex trajectories. Using this measure, we explore how the phasespace dimension of the n ODEs and the complexity of the function modeling the flow field contribute to expressivity. We see that a more complex function for modeling the flow field allows a lowerdimensional n ODE to capture a given target dynamics. Finally, we demonstrate the benefit of gating in n ODEs on several real-world tasks. |
| Researcher Affiliation | Academia | 1Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA 2School of Natural Sciences, Institute for Advanced Study, Princeton, NJ, USA 3Joseph Henry Laboratories of Physics, Princeton University, Princeton, NJ, USA. |
| Pseudocode | No | The paper describes mathematical models and algorithms through equations and text, but it does not provide a clearly labeled pseudocode block or algorithm steps in a structured format. |
| Open Source Code | No | The paper mentions implementing networks with their Julia package, RNNTools.jl, which is based on other libraries. However, it does not explicitly state that RNNTools.jl itself is open-source, nor does it provide a direct link to the source code for the methodology described in the paper. |
| Open Datasets | Yes | This dataset ( Character Trajectories ) is originally from the UEA time series classification archive (Bagnall et al., 2018), and we used the preprocessed data obtained from the Neural CDE repository (see Appendix H.2 and Kidger et al. (2020) for details). The preprocessed data for this task were obtained from the ODE-LSTM repository, (see Appendix H.3 and Lechner & Hasani (2020) for details). The dataset is originally from Warden (2018) and preprocessed using the pipeline in the Neural CDE repository (see Appendix H.4 and Kidger et al. (2020) for details). |
| Dataset Splits | Yes | 600 trials were generated total, where 500 trials were used for training and the remaining 100 trials were used for validation. This dataset had 2858 trials total (2000 trials for training, 429 trials for validation and 429 trials for testing)... This dataset had 12, 893 trials total (9684 trials for training, 1272 trials for validation and 1937 trials for testing)... There are total 34, 975 time series (70% training, 15% validation, and 15% test data)... |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU specifications, or cloud computing instance types. |
| Software Dependencies | Yes | All of the networks presented in this work (vanilla RNN, m GRU, GRU, LSTM, LEM, n ODE and gn ODE) are implemented with our Julia (Bezanson et al., 2017) package, RNNTools.jl. This package is based on Flux.jl, Differential Equations.jl and Diff Eq Flux.jl (Innes, 2018; Rackauckas & Nie, 2017; Rackauackas et al., 2020). We used Newton s method (implemented in Julia s NLsolve.jl package; Mogensen & Riseth (2018)). |
| Experiment Setup | Yes | To ensure fair comparisons across different networks... for each network, we ran 3 × 3 × 3 = 27 different configurations of (η, λw, B), where η = {10−4, 10−3, 10−2}, λw = {10−3, 10−2, 10−1} and B = {10, 50, 100}. For each network and each configuration, we trained for 600 epochs... The initial states of the networks were not learned, and were initialized with h0 ∼ N(0, Σ), where Σ = 2/(N + 1)I was the variance. We trained our networks for the total of 1300 epochs, where we first trained only the first 14 time-bins for 100 epochs, and then the first 28 time-bins for the next 100 epochs, until we reached 182 time-bins. We performed a grid search over the learning rate η ∈ {10−4, 10−3, 10−2}, rate of weight decay λw ∈ {10−3, 10−2, 10−1}, initialization scheme (Glorot normal, Kaiming normal or the critical initialization proposed in Section 4 and Appendix A; biases were always initialized with a zero-mean Gaussian with variance 10−6), phase-space dimension N ∈ {32, 100, 316}. |