Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Marginal Distribution Adaptation for Discrete Sets via Module-Oriented Divergence Minimization

Authors: Hanjun Dai, Mengjiao Yang, Yuan Xue, Dale Schuurmans, Bo Dai

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we will ﬁrst validate the correctness of our marginal adaptation framework on synthetic datasets for all the three models in Section 5.1. Then in Section 5.2 we study the effectiveness of the framework in adapting the learned distribution to the target distribution via marginal alignment using real-world datasets. We present the experiment conﬁgurations for model architectures, training and evaluation methods used in both sections.
Researcher Affiliation	Industry	1Google Research, Brain Team 2Google Cloud.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement about releasing its source code for the methodology described, nor does it include any links to a code repository.
Open Datasets	Yes	MIMIC3: We curate this dataset based on the encounter ICD9 diagnosis codes from MIMIC-III (Johnson et al., 2016), an open source EHR dataset. Instacart (ins): This dataset comes from the Kaggle Instacart Market Basket Analysis competition. We select the top 1,000 popular products for generation and control experiments.
Dataset Splits	Yes	Without timing information, we randomly split the Groceries dataset into Dsrc and Dtgt with ratio 9:1. For Instacart, we use its own prior set as Dsrc and train as Dtgt. For all the others with timing information, we sort the datasets according to the timestamp and then use the ﬁrst 90% as Dsrc and rest 10% as Dtgt.
Hardware Specification	Yes	By default we train all the base models p and adapted models q on a single Nvidia V100 GPU with batch size 128, using Adam optimizer.
Software Dependencies	No	The paper mentions software components like "Adam optimizer", "PCD framework", and "GWG-sampler", but it does not specify version numbers for these software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	Model conﬁguration: We present the default model conﬁgurations here unless later speciﬁed. LVM: ... MLP with 2 hidden layers of 512 Re LU activated neurons... Autoregressive: We use Transformers... 4 layers with 8 heads... dimensions for embedding and feed-forward layers are 256 and 512... EBM: We use an MLP with 2 hidden layers of 512 Re LU activated neurons for f used in p. Training conﬁguration: By default we train all the base models p and adapted models q on a single Nvidia V100 GPU with batch size 128, using Adam optimizer. For EBMs training we leverage the PCD framework... We use GWG-sampler... The number of MCMC steps per gradient update varies within {50, 100, 200}.