reproducibilityindex.ai

Missing Data Imputation using Optimal Transport

Authors: Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
Researcher Affiliation	Collaboration	1CREST-ENSAE, IP Paris, Palaiseau, France 2XPOP, INRIA Saclay, France 3CMAP, UMR7641, Ecole Polytechnique, IP Paris, Palaiseau, France 4LPSM, Sorbonne Universit e, ENS Paris, France 5Google Brain, Paris, France.
Pseudocode	Yes	Algorithm 1 Batch Sinkhorn Imputation [...] Algorithm 2 Meta Sinkhorn Imputation [...] Algorithm 3 Round-Robin Sinkhorn Imputation
Open Source Code	Yes	The code to reproduce these experiments is available at https://github. com/Boris Muzellec/Missing Data OT.
Open Datasets	Yes	We evaluate each method on 23 datasets from the UCI machine learning repository (see Table 1) with varying proportions of missing data and different missing data mechanisms. These datasets only contain quantitative features.
Dataset Splits	No	For the main experiments, the paper does not specify a separate validation split percentage or how it was used. For the Out-of-Sample (OOS) experiment, it states: 'we randomly sample 70% of the data to be used for training, and the remaining 30% to evaluate OOS imputation.' This is a train/test split, but not an explicit validation split as per the definition criteria.
Hardware Specification	No	The paper states, 'GPUs are used for Sinkhorn and deep learning methods.' However, this is a general statement and does not provide specific details such as GPU models, CPU specifications, or memory, which are required for reproducibility.
Software Dependencies	No	The paper mentions software like 'scikit-learn', 'mice', 'Adam', and 'RMSprop'. However, it does not provide specific version numbers for any of these, which is necessary for reproducible software dependency information.
Experiment Setup	Yes	If the dataset has more than 256 points, the batch size is ﬁxed to 128, otherwise to 2b n 2 c where n is the size of the dataset. The noise parameter in Algorithm 1 is ﬁxed to 0.1. For Sinkhorn round-robin models (Linear RR and MLP RR), the maximum number of cycles is 10, 10 pairs of batches are sampled per gradient update, and an 2-weight regularization of magnitude 10 5 is applied during training. For all 3 Sinkhorn-based methods, we use gradient methods with adaptive step sizes as per algorithms 1 and 3, with an initial step size ﬁxed to 10 2.