Missing Data Imputation using Optimal Transport

Authors: Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
Researcher Affiliation Collaboration 1CREST-ENSAE, IP Paris, Palaiseau, France 2XPOP, INRIA Saclay, France 3CMAP, UMR7641, Ecole Polytechnique, IP Paris, Palaiseau, France 4LPSM, Sorbonne Universit e, ENS Paris, France 5Google Brain, Paris, France.
Pseudocode Yes Algorithm 1 Batch Sinkhorn Imputation [...] Algorithm 2 Meta Sinkhorn Imputation [...] Algorithm 3 Round-Robin Sinkhorn Imputation
Open Source Code Yes The code to reproduce these experiments is available at https://github. com/Boris Muzellec/Missing Data OT.
Open Datasets Yes We evaluate each method on 23 datasets from the UCI machine learning repository (see Table 1) with varying proportions of missing data and different missing data mechanisms. These datasets only contain quantitative features.
Dataset Splits No For the main experiments, the paper does not specify a separate validation split percentage or how it was used. For the Out-of-Sample (OOS) experiment, it states: 'we randomly sample 70% of the data to be used for training, and the remaining 30% to evaluate OOS imputation.' This is a train/test split, but not an explicit validation split as per the definition criteria.
Hardware Specification No The paper states, 'GPUs are used for Sinkhorn and deep learning methods.' However, this is a general statement and does not provide specific details such as GPU models, CPU specifications, or memory, which are required for reproducibility.
Software Dependencies No The paper mentions software like 'scikit-learn', 'mice', 'Adam', and 'RMSprop'. However, it does not provide specific version numbers for any of these, which is necessary for reproducible software dependency information.
Experiment Setup Yes If the dataset has more than 256 points, the batch size is fixed to 128, otherwise to 2b n 2 c where n is the size of the dataset. The noise parameter in Algorithm 1 is fixed to 0.1. For Sinkhorn round-robin models (Linear RR and MLP RR), the maximum number of cycles is 10, 10 pairs of batches are sampled per gradient update, and an 2-weight regularization of magnitude 10 5 is applied during training. For all 3 Sinkhorn-based methods, we use gradient methods with adaptive step sizes as per algorithms 1 and 3, with an initial step size fixed to 10 2.