TRIAGE: Characterizing and auditing training data for improved regression
Authors: Nabeel Seedat, Jonathan Crabbé, Zhaozhi Qian, Mihaela van der Schaar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate the utility of TRIAGE across multiple use cases satisfying P1-P4, including consistent characterization, sculpting to improve performance in a variety of settings, as well as, guiding dataset selection and feature acquisition. Datasets. We conduct experiments on 10 real-world regression datasets with varying characteristics. |
| Researcher Affiliation | Academia | Nabeel Seedat University of Cambridge ns741@cam.ac.uk Jonathan Crabbé University of Cambridge jc2133@cam.ac.uk Zhaozhi Qian University of Cambridge zq224@cam.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk |
| Pseudocode | Yes | Algorithm 1 Computing a CPD |
| Open Source Code | Yes | Code: https://github.com/seedatnabeel/TRIAGE or https://github.com/vanderschaarlab/TRIAGE |
| Open Datasets | Yes | The datasets are drawn from diverse domains, including safety-critical medical regression: (i) Prostate cancer from the US [31] and UK [32], (ii) Hospital Length of Stay [33] and (iii) MIMIC Antibiotics [34]. Additionally, we analyze general UCI regression datasets [35], including Bike, Boston Housing, Bio, Concrete, Protein and Star. The datasets are detailed in Appendix B, along with further experimental details. |
| Dataset Splits | Yes | We partitioned each dataset into training, validation, and testing sets using an 80:10:10 split. All experiments are repeated 5 times, with different random seeds for consistency, and the average and standard deviation are reported. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions types of models used (e.g., Neural Networks, XGBoost) but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | Neural Networks (NNs) are built using a 3-layer Multi-Layer Perceptron (MLP) with 100 hidden units per layer, ReLU activation, and trained for 100 epochs using the Adam optimizer with a learning rate of 1e-3 and a batch size of 128. Early stopping is used with patience of 10 epochs. XGBoost models are trained with 1000 estimators, a learning rate of 0.1, and early stopping patience of 10 rounds. |