Scale-invariant Bayesian Neural Networks with Connectivity Tangent Kernel

Authors: SungYub Kim, Sihwan Park, Kyung-Su Kim, Eunho Yang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS Here we describe experiments demonstrating (i) the effectiveness of Connectivity Sharpness (CS) as a generalization measurement metric and (ii) the usefulness of Connectivity Laplace (CL) as a general-purpose Bayesian NN: With CS and CL, we can resolve the contradiction in the FM hypothesis concerning the generalization of NNs and attain stable calibration performance for various ranges of prior scales.
Researcher Affiliation Collaboration Sung-Yub Kim1, Sihwan Park1, Kyungsu Kim3,4,5 , Eunho Yang1,2 Korea Advanced Institute of Science and Technology (KAIST)1, AITRICS2, Samsung Medical AI Research Center3, Sungkyunkwan University School of Medicine4, Massachusetts General Hospital and Harvard Medical School5
Pseudocode Yes In Algorithm 1, we provide a pseudo-code for the RTO implementation of CL. Note that both time and memory complexity of computing linearized NN for mini-batch B is comparable to a forward propagation as shown in Novak et al. (2022) with jax.jvp function in JAX (Bradbury et al., 2018). In Algorithm 2, we provide a pseudo-code for the implementation.
Open Source Code Yes 1https://github.com/sungyubkim/connectivity-tangent-kernel
Open Datasets Yes We use CIFAR-10 and 100 datasets (Krizhevsky, 2009), where the 50K training instances are randomly partitioned into SP of cardinality 45K and SQ of cardinality 5K. [...] UCI regression datasets (Hernández-Lobato & Adams, 2015) and its GAP-variants (Foong et al., 2019)
Dataset Splits Yes We use CIFAR-10 and 100 datasets (Krizhevsky, 2009), where the 50K training instances are randomly partitioned into SP of cardinality 45K and SQ of cardinality 5K.
Hardware Specification Yes For every experiment, we use 8 NVIDIA RTX 3090 GPUs.
Software Dependencies No The paper mentions software like 'TensorFlow, Pytorch, and JAX' and specific functions like 'jax.jvp', but it does not provide specific version numbers for any of these software components, which is necessary for reproducible dependency information.
Experiment Setup Yes We pre-train Res Net-18 (He et al., 2016) with a mini-batch size of 1K on SP with SGD of initial learning rate 0.4 and momentum 0.9. We use cosine annealing for learning rate scheduling (Loshchilov & Hutter, 2016) with a warmup for the initial 10% training step. We fix δ = 0.1, α = 0.1, and σ = 1.0 to compute equation 8. Table 6: Configuration of hyper-parameter includes network depth (1,2,3), network width (32,64,128), learning rate (0.1, 0.032, 0.001), WD (0.0, 1e-4, 5e-4), mini-batch size (256, 1024, 4096). We use an SGD optimizer with momentum of 0.9. We train each model for 200 epochs and use cosine learning rate scheduler (Loshchilov & Hutter, 2016) with 30% of initial epochs as warm-up epochs.