Efficient Uncertainty Quantification and Reduction for Over-Parameterized Neural Networks

Authors: Ziyi Huang, Henry Lam, Haofeng Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct numerical experiments to demonstrate the effectiveness of our approaches.2 Our proposed approaches are evaluated on the following two tasks: 1) construct confidence intervals and 2) reduce procedural variability to improve prediction. With a known ground-truth regression function, training data are regenerated from the underlying synthetic data generative process. According to the NTK parameterization in Section C, our base network is formed with two fully connected layers with n 32 neurons in each hidden layer to ensure the network is sufficiently wide and over-parameterized. Detailed optimization specifications are described in Proposition C.3. Our synthetic datasets #1 are generated with the following distributions: X Unif([0, 0.2]d) and Y Pd i=1 sin(X(i)) + N(0, 0.0012). The training set D = {(xi, yi) : i = 1, ..., n} is formed by drawing i.i.d. samples of (X, Y ) from the above distribution with sample size n. We consider multiple dimension settings d = 2, 4, 8, 16 and data size settings n = 128, 256, 512, 1024 to study the effects on different dimensionalities and data sizes. Additional experimental results on more datasets are presented in Appendix F. The implementation details of our experiments are also provided in Appendix F.
Researcher Affiliation Academia Ziyi Huang, Henry Lam, Haofeng Zhang Columbia University New York, NY, USA zh2354,khl2114,hz2553@columbia.edu
Pseudocode Yes Algorithm 1 Procedural-Noise-Correcting (PNC) Predictor Input: Training data D = {(x1, y1), (x2, y2), ..., (xn, yn)}. Procedure: 1. Draw θb Pθb = N(0, Ip) under NTK parameterization. Train a standard base network with data D and the initialization parameters θb, which outputs ˆhn,θb( ) in (3). 2. Let s(x) = EPθb [sθb(x)]. For each xi in D, generate its "artificial" label s(xi) = EPθb [sθb(xi)]. Train an auxiliary neural network with data {(x1, s(x1)), (x2, s(x2)), ..., (xn, s(xn))} and the initialization parameters θb (the same one as in Step 1.), which outputs ˆϕ n,θb( ). Subtracting s( ), we obtain ˆϕn,θb( ) = ˆϕ n,θb( ). Output: At point x, output ˆhn,θb(x) ˆϕn,θb(x).
Open Source Code Yes 2The source code for experiments is available at https://github.com/HZ0000/UQfor NN.
Open Datasets No The paper mentions synthetic datasets #1 and #2 (e.g., "X Unif([0, 0.2]d) and Y Pd i=1 sin(X(i)) + N(0, 0.0012)") and real-world UCI datasets (Boston, Concrete, Energy), but does not provide explicit links, DOIs, or formal citations for public access to these datasets beyond their names.
Dataset Splits No The paper mentions training data size n = 128, 256, 512, 1024 and test point x0, but does not specify train/validation/test splits, percentages, or cross-validation strategies in the main text. It mentions "80%/20% random split for training data and test data in the original dataset" for real-world datasets in Appendix F.2, but doesn't specify how validation sets are handled, if any.
Hardware Specification Yes All experiments are conducted on a single Ge Force RTX 2080 Ti GPU.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries. It mentions "ReLU activation function" but no version.
Experiment Setup Yes The network has 32 n hidden neurons in its hidden layer where n is the size of the entire training data. The network should be sufficiently wide so that the NTK theory holds. [...] The network is trained using the regularized square loss (1) with regularization hyperparameter λn 0.110. [...] The network is trained using the (full) batch gradient descent (by feeding the whole dataset). [...] The learning rate and training epochs are properly tuned based on the specific dataset. The epochs should not be too small since the training needs to converge to a good solution, but the learning rate should also not be too large because we need to stipulate that the training procedure indeed operates in the NTK regime (an area around the initialization). Note that in practice, we cannot use the continuous-time gradient flow, and the network can never be infinitely wide. Therefore, with the fixed learning rate in gradient descent, we do not greatly increase the number of epochs so that the training procedure will likely find a solution that is not too far from the initialization. [...] We set m = 4 in the PNC-enhanced batching approach and R = 4 in the PNC-enhanced cheap bootstrap approach.