TRIAGE: Characterizing and auditing training data for improved regression

Authors: Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate the utility of TRIAGE across multiple use cases satisfying P1-P4, including consistent characterization, sculpting to improve performance in a variety of settings, as well as, guiding dataset selection and feature acquisition. Datasets. We conduct experiments on 10 real-world regression datasets with varying characteristics.
Researcher Affiliation Academia Nabeel Seedat University of Cambridge ns741@cam.ac.uk Jonathan Crabbé University of Cambridge jc2133@cam.ac.uk Zhaozhi Qian University of Cambridge zq224@cam.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk
Pseudocode Yes Algorithm 1 Computing a CPD
Open Source Code Yes Code: https://github.com/seedatnabeel/TRIAGE or https://github.com/vanderschaarlab/TRIAGE
Open Datasets Yes The datasets are drawn from diverse domains, including safety-critical medical regression: (i) Prostate cancer from the US [31] and UK [32], (ii) Hospital Length of Stay [33] and (iii) MIMIC Antibiotics [34]. Additionally, we analyze general UCI regression datasets [35], including Bike, Boston Housing, Bio, Concrete, Protein and Star. The datasets are detailed in Appendix B, along with further experimental details.
Dataset Splits Yes We partitioned each dataset into training, validation, and testing sets using an 80:10:10 split. All experiments are repeated 5 times, with different random seeds for consistency, and the average and standard deviation are reported.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions types of models used (e.g., Neural Networks, XGBoost) but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes Neural Networks (NNs) are built using a 3-layer Multi-Layer Perceptron (MLP) with 100 hidden units per layer, ReLU activation, and trained for 100 epochs using the Adam optimizer with a learning rate of 1e-3 and a batch size of 128. Early stopping is used with patience of 10 epochs. XGBoost models are trained with 1000 estimators, a learning rate of 0.1, and early stopping patience of 10 rounds.