On Neural Networks as Infinite Tree-Structured Probabilistic Graphical Models

Authors: Boyao Li, Alexander Thomson, houssam nassif, Matthew Engelhard, David Page

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically that after training a network quickly using SGD, calibration can be improved by fine-tuning using HMC. The specific HMC algorithm employed here follows directly from the theoretical result, being designed to approximate Gibbs-sampling in the theoretical, infinite-width tree structured Markov network. The degree of approximation is controlled by the value of a single hyperparameter that is also defined based on the theoretical result. (...) Finally, in the context of sigmoid activations we empirically evaluate how the second and third benefits listed above follow from the result, as motivated and summarized now in the next paragraph.
Researcher Affiliation Collaboration Boyao Li Department of Biostatistics and Bioinformatics Duke University boyao.li@duke.edu Alexander J. Thomson Department of Computer Science Duke University alexander.thomson@duke.edu Houssam Nassif Meta Inc. houssamn@meta.com Matthew M. Engelhard Department of Biostatistics and Bioinformatics Duke University m.engelhard@duke.edu David Page Department of Biostatistics and Bioinformatics Duke University david.page@duke.edu
Pseudocode Yes Algorithm 1 Step 1 of the PGM Construction (...) Algorithm 2 Step 2 of the PGM Construction (...) Algorithm 3 CD-k Learning for the Deep Belief Network
Open Source Code Yes All code needed to reproduce our experimental results may be found at https://github.com/ engelhard-lab/DNN_Tree PGM.
Open Datasets Yes The synthetic datasets are generated by simple BNs and MNs with their weights in different ranges, which are used to define the conditional probabilistic distributions for BNs and potentials for MNs. Each dataset contains 1000 data points {(Xi, yi)}, i = 1, 2, ..., 1000, where each input Xi {0, 1}n is a binary vector with n dimension and each output yi {0, 1} is a binary value. (...) Similar experiments are also run on the Covertype dataset to compare the calibration of SGD in DNNs, Gibbs and the HMC-based algorithm. Since the ground truth for the distribution of P(y|X) cannot be found, the metric for the calibration used in this experiment is the expected calibration error (ECE), which is a common metric for model calibration. To simplify the classification task, we choose the data with label 1 and 2 and build two binary subsets, each of which contains 1000 data points. (...) Covertype. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50K5N.
Dataset Splits Yes For all the experiments, the train-test split ratio is 80:20.
Hardware Specification No An internal cluster of GPUs was employed for all experiments, and part of run-times are provided in the appendix; as anticipated, SGD is faster than HMC, which is faster than Gibbs. (No specific GPU model or detailed cluster specs provided beyond 'GPUs' and 'internal cluster').
Software Dependencies No Adam optimizer is used with learning rate being 1 10 4. (No specific version numbers are provided for any software components like Adam, Python, or relevant libraries).
Experiment Setup Yes For all the experiments, the train-test split ratio is 80:20. For the training and finetuning, Adam optimizer is used with learning rate being 1 10 4. To get the predicted probabilities for fine-tuned network, 1000 output probabilities are sampled and averaged. In synthetic experiments, both BNs and MNs have the structure with the input dimension being 4, two latent layers with 4 nodes in each one, and one binary output. (...) Here L defines the normal distribution for hidden nodes in Eqn. 1 and is explored across the set of values: {10, 100, 1000}. (...) The number of training epochs is also 100 or 1000, while the fine-tuning epochs shown in Table 2 is 20.