Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome
Authors: Elliott Gordon-Rodriguez, Thomas Quinn, John P. Cunningham
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn s disease. We evaluate our augmentation strategies on 12 standard binary classification tasks taken from the Microbiome Learning Repo [60]. |
| Researcher Affiliation | Academia | Elliott Gordon-Rodriguez Columbia University eg2912@columbia.edu Thomas P. Quinn Independent Scientist contacttomquinn@gmail.com John P. Cunninghham Columbia University jpc2181@columbia.edu |
| Pseudocode | No | The paper describes its algorithms using numbered steps and mathematical formulations within the main text (e.g., Section 3.1, 3.2, 3.3), but these are not presented in a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our code is available at https://github.com/cunningham-lab/AugCoDa. |
| Open Datasets | Yes | We evaluate our augmentation strategies on 12 standard binary classification tasks taken from the Microbiome Learning Repo [60]. More details on these 12 learning tasks can be found in Table 1; note this benchmark is constructed from the original Microbiome Learning Repo by filtering datasets that contain a minimum sample size of 100, with at least 50 in either class. Table 1: Evaluation benchmark consisting of 12 binary classification tasks taken from the Microbiome Learning Repo [60], after filtering to datasets containing at least 100 samples with at least 50 in each class. For each task we show the number of samples (n), the number of features (p), a description of the two classes and the number of samples in each, together with a reference to the original studies that each dataset was obtained from. |
| Dataset Splits | No | For each learning task, we take 20 independent 80/20 train/test splits and we fit Random Forest, XGBoost, m AML [63], Deep Co Da [48], and Meta NN [37], first to the original training data, then on 3 augmented training sets obtained using our 3 augmentation strategies. The paper does not explicitly state a separate validation split for hyperparameter tuning or early stopping within the main text for general reproducibility. |
| Hardware Specification | No | We train these models in parallel on a CPU cluster. No specific hardware details like CPU model, memory, or GPU types are provided. |
| Software Dependencies | No | The paper mentions using various models like Random Forest, XGBoost, m AML, Deep Co Da, and Meta NN, but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, scikit-learn versions). |
| Experiment Setup | Yes | For the augmented training sets, we generated 10 times as many synthetic samples as there were original training examples. ... by downweighting the synthetic samples by a factor of 10; the total weight of the original and synthetic data is then equal to 1/2 each. ... whenever a random mixing parameter is required, we use a U(0, 1) distribution... Full detail on model architecture and implementation is provided in Appendix B. |