Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can Diffusion Models Disentangle? A Theoretical Perspective

Authors: Liming Wang, Muhammad Jehanzeb Mirza, Yishu Gong, Yuan Gong, Jiaqi Zhang, Brian Tracey, Katerina Placek, Marco Vilela, Jim Glass

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate our theory, we conduct extensive disentanglement experiments on subspace recovery in latent subspace Gaussian mixture models, image colorization, denoising, and voice conversion for speech classification. Our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance.
Researcher Affiliation	Collaboration	1Massachusetts Institute of Technology, 2Takeda
Pseudocode	No	The paper describes mathematical formulations and system architectures (e.g., "dual encoder network sθ(x,t):=UsθZ Z (x,t)+V sθG G (g(x),t)" in Section 4.3 and Figure 2), but it does not include a formally structured pseudocode or algorithm block.
Open Source Code	No	Data and code will be released upon acceptance.
Open Datasets	Yes	We begin with synthetic datasets generated from Gaussian mixture models (GMM),... We then move to more realistic settings using standard image datasets. Specifically, we apply our DM-based disentanglement method to two tasks: image colorization on MNIST [71] and image denoising on CIFAR10 [72]. Finally, we validate Theorem 4.1-4.2 on a real-world speech task,... We use speech emotion recognition on the IEMOCAP dataset [76] as a testbed
Dataset Splits	Yes	We use speech emotion recognition on the IEMOCAP dataset [76] as a testbed, where generalization across unseen speakers serves as a proxy for successful disentanglement. This task instantiates multi-view disentanglement (Definition 3.4): different speakers provide style views, and the shared emotion across recordings is the content. ... For the IEMOCAP dataset, we use a system available on Speech Brain [135] that finetunes on the wav2vec 2.0 backbone [136] with a multi-layer perceptron classifier (MLP) [137]. The classifier is trained using Adam optimizer for 30 epochs with a batch size of 4 and a learning rate of 10 4 for the MLP and the 10 5 learning rate for wav2vec 2.0 weights. The system is then evaluated using the standard classification accuracy metric and 5-fold cross validation [76, 138]. For each fold, we use all 8 speakers from the training set as target speakers. On the ALS and ADRe SS, we use whisper-medium [129] features, as they have shown to be the most effective for speech impairment classification [139]. To avoid unfair comparison, We concatenate hidden representations over all layers of the whisper-medium encoder rather than selecting a particular layer and perform mean pooling over the frame-level features. For both datasets, we follow the standard splits used in previous works [79] to have no overlaps between speaker in the training and test sets.
Hardware Specification	Yes	All models are implemented in Pytorch [134] on two A5000 GPUs.
Software Dependencies	Yes	All models are implemented in Pytorch [134] on two A5000 GPUs. The training time is approximately an hour for both datasets and the inference is approximately 10 seconds for 64 samples. ... For the IEMOCAP dataset, we use a system available on Speech Brain [135] that finetunes on the wav2vec 2.0 backbone [136] with a multi-layer perceptron classifier (MLP) [137]. ... we use whisper-medium [129] features
Experiment Setup	Yes	Implementation details. For the GMM dataset, we use a two-layer Re LU network consistent with Theorem 4.5. ... We train the models for 10,000 steps with an Adam [131] optimizer with learning rate 10 5 and batch size equal to the entire training set. ... For MNIST and CIFAR, we use a U-Net [73] architecture, following common DM design [74]. ... For both datasets, we train the DM using an Adam optimizer [131] with a batch size of 128 and a learning rate 10 4 for 50 epochs. A VE schedule is used during conditional score matching. During inference, we use probability flow [74] with 500 steps to perform sampling.