PAGER: Accurate Failure Characterization in Deep Regression Models
Authors: Jayaraman J. Thiagarajan, Vivek Narayanaswamy, Puja Trivedi, Rushil Anirudh
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results reveal that, when compared to state-of-the-art detectors, the risk regimes identified by PAGER align best with the true risk. |
| Researcher Affiliation | Collaboration | 1Lawrence Livermore National Labs, CA, USA 2University of Michigan, USA 3Amazon, CA, USA. |
| Pseudocode | Yes | Algorithms 1,2 and 3 provide the details for estimating predictive uncertainty, non-conformity scores Score1 and Score2 respectively in PAGER. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code for the methodology described. |
| Open Datasets | Yes | 1D Benchmark Functions: (a) f1(x) = ( x2 if x < 2.25 or x > 3.01 x2 20 otherwise (Figure 1) (b) f2(x) = sin(2πx), x [ 0.5, 2.5] (c) f3(x) = a exp( bx)+exp(cos(cx)) a exp(1), x [ 5, 5], a = 20, b = 0.2, c = 2π (d) f4(x) = sin(x) cos(5x) cos(22x), x [ 1, 2]; HD Regression Benchmarks: (a) Camel (2D), (b) Levy (2D) (ben) characterized by multiple local minima, (c) Airfoil (5D), (d) NO2 (7D), (e) Kinematics (8D), (f) Puma (8D) (del) which are simulated datasets of the forward dynamics of different robotic control arms, (g) Boston Housing (13D) (bh), (h) Ailerons (39D) (ail) which is a dataset for predicting control action of the ailerons of an F16 aircraft, and (i) Drug-Target Interactions (32000D).; Image Regression: We used three image regression benchmarks namely chair (yaw) angle, cell count and CIFAR-10 rotation prediction respectively. Ailerons datsets. https://www.dcc.fc.up.pt/ ~ltorgo/Regression/Data Sets.html. Accessed: 2023-05-11. Boston housing. https://scikit-learn.org/1. 0/modules/generated/sklearn.datasets. load_boston.html. Accessed: 2023-05-11. Delve datasets. https://www.cs.toronto.edu/ ~delve/data/datasets.html. Accessed: 202305-11. |
| Dataset Splits | Yes | For evaluation, we used the held-out test sets (e.g., 10K randomly rotated images for CIFAR-10). We computed the test performance (R2 statistic) in both observed (range of y values exposed during training) and unobserved (range of y values unseen during training) regimes for the three image regression benchmarks. |
| Hardware Specification | No | The paper mentions "measured using a test set of 1000 samples on the 1D benchmarks with a single GPU" but does not specify the model of the GPU or any other hardware components. |
| Software Dependencies | No | The paper mentions various models and optimizers like "MLP", "Wide Res Net40-2", "Res Net-34", and "ADAM optimizer", but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For experiments on all tabular benchmarks, we used an MLP (Bishop & Nasrabadi, 2007) with 4 layers each with a hidden dimension of 128. While we used the Wide Res Net40-2 model (Zagoruyko & Komodakis, 2016) for the first two image regression datasets, in the case of CIFAR-10, we randomly applied a rotation transformation [0 90 degrees] to each 32 32 3 image and trained a Res Net-34 model to predict the angle of rotation. The training is performed with a batch size of 128 for 100 epochs. We utilize the ADAM optimizer with momentum parameters of (0.9, 0.999) and a fixed learning rate of 1e 4. |