Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome

Authors: Elliott Gordon-Rodriguez, Thomas Quinn, John P. Cunningham

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn s disease. We evaluate our augmentation strategies on 12 standard binary classification tasks taken from the Microbiome Learning Repo [60].
Researcher Affiliation Academia Elliott Gordon-Rodriguez Columbia University eg2912@columbia.edu Thomas P. Quinn Independent Scientist contacttomquinn@gmail.com John P. Cunninghham Columbia University jpc2181@columbia.edu
Pseudocode No The paper describes its algorithms using numbered steps and mathematical formulations within the main text (e.g., Section 3.1, 3.2, 3.3), but these are not presented in a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our code is available at https://github.com/cunningham-lab/AugCoDa.
Open Datasets Yes We evaluate our augmentation strategies on 12 standard binary classification tasks taken from the Microbiome Learning Repo [60]. More details on these 12 learning tasks can be found in Table 1; note this benchmark is constructed from the original Microbiome Learning Repo by filtering datasets that contain a minimum sample size of 100, with at least 50 in either class. Table 1: Evaluation benchmark consisting of 12 binary classification tasks taken from the Microbiome Learning Repo [60], after filtering to datasets containing at least 100 samples with at least 50 in each class. For each task we show the number of samples (n), the number of features (p), a description of the two classes and the number of samples in each, together with a reference to the original studies that each dataset was obtained from.
Dataset Splits No For each learning task, we take 20 independent 80/20 train/test splits and we fit Random Forest, XGBoost, m AML [63], Deep Co Da [48], and Meta NN [37], first to the original training data, then on 3 augmented training sets obtained using our 3 augmentation strategies. The paper does not explicitly state a separate validation split for hyperparameter tuning or early stopping within the main text for general reproducibility.
Hardware Specification No We train these models in parallel on a CPU cluster. No specific hardware details like CPU model, memory, or GPU types are provided.
Software Dependencies No The paper mentions using various models like Random Forest, XGBoost, m AML, Deep Co Da, and Meta NN, but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, scikit-learn versions).
Experiment Setup Yes For the augmented training sets, we generated 10 times as many synthetic samples as there were original training examples. ... by downweighting the synthetic samples by a factor of 10; the total weight of the original and synthetic data is then equal to 1/2 each. ... whenever a random mixing parameter is required, we use a U(0, 1) distribution... Full detail on model architecture and implementation is provided in Appendix B.