reproducibilityindex.ai

PAGER: Accurate Failure Characterization in Deep Regression Models

Authors: Jayaraman J. Thiagarajan, Vivek Narayanaswamy, Puja Trivedi, Rushil Anirudh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results reveal that, when compared to state-of-the-art detectors, the risk regimes identified by PAGER align best with the true risk.
Researcher Affiliation	Collaboration	1Lawrence Livermore National Labs, CA, USA 2University of Michigan, USA 3Amazon, CA, USA.
Pseudocode	Yes	Algorithms 1,2 and 3 provide the details for estimating predictive uncertainty, non-conformity scores Score1 and Score2 respectively in PAGER.
Open Source Code	No	The paper does not provide an explicit statement or link for the open-source code for the methodology described.
Open Datasets	Yes	1D Benchmark Functions: (a) f1(x) = ( x2 if x < 2.25 or x > 3.01 x2 20 otherwise (Figure 1) (b) f2(x) = sin(2πx), x [ 0.5, 2.5] (c) f3(x) = a exp( bx)+exp(cos(cx)) a exp(1), x [ 5, 5], a = 20, b = 0.2, c = 2π (d) f4(x) = sin(x) cos(5x) cos(22x), x [ 1, 2]; HD Regression Benchmarks: (a) Camel (2D), (b) Levy (2D) (ben) characterized by multiple local minima, (c) Airfoil (5D), (d) NO2 (7D), (e) Kinematics (8D), (f) Puma (8D) (del) which are simulated datasets of the forward dynamics of different robotic control arms, (g) Boston Housing (13D) (bh), (h) Ailerons (39D) (ail) which is a dataset for predicting control action of the ailerons of an F16 aircraft, and (i) Drug-Target Interactions (32000D).; Image Regression: We used three image regression benchmarks namely chair (yaw) angle, cell count and CIFAR-10 rotation prediction respectively. Ailerons datsets. https://www.dcc.fc.up.pt/ ~ltorgo/Regression/Data Sets.html. Accessed: 2023-05-11. Boston housing. https://scikit-learn.org/1. 0/modules/generated/sklearn.datasets. load_boston.html. Accessed: 2023-05-11. Delve datasets. https://www.cs.toronto.edu/ ~delve/data/datasets.html. Accessed: 202305-11.
Dataset Splits	Yes	For evaluation, we used the held-out test sets (e.g., 10K randomly rotated images for CIFAR-10). We computed the test performance (R2 statistic) in both observed (range of y values exposed during training) and unobserved (range of y values unseen during training) regimes for the three image regression benchmarks.
Hardware Specification	No	The paper mentions "measured using a test set of 1000 samples on the 1D benchmarks with a single GPU" but does not specify the model of the GPU or any other hardware components.
Software Dependencies	No	The paper mentions various models and optimizers like "MLP", "Wide Res Net40-2", "Res Net-34", and "ADAM optimizer", but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For experiments on all tabular benchmarks, we used an MLP (Bishop & Nasrabadi, 2007) with 4 layers each with a hidden dimension of 128. While we used the Wide Res Net40-2 model (Zagoruyko & Komodakis, 2016) for the first two image regression datasets, in the case of CIFAR-10, we randomly applied a rotation transformation [0 90 degrees] to each 32 32 3 image and trained a Res Net-34 model to predict the angle of rotation. The training is performed with a batch size of 128 for 100 epochs. We utilize the ADAM optimizer with momentum parameters of (0.9, 0.999) and a fixed learning rate of 1e 4.