Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Correcting Flaws in Common Disentanglement Metrics

Authors: Louis Mahon, Lei Sha, Thomas Lukasiewicz

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We measure the performance of six existing disentanglement models on this downstream compositional generalization task, and show that performance is (a) generally quite poor, (b) correlated, to varying degrees, with most disentanglement metrics, and (c) most strongly correlated with our newly proposed metrics. The code for our metrics is available at https://github.com/Lou1s M/snc_nk.
Researcher Affiliation	Academia	Louis Mahon EMAIL School of Informatics University of Edinburgh; Lei Sha Artificial Intelligence Institute Beihang University; Thomas Lukasiewicz Institute of Logic and Computation Vienna University of Technology
Pseudocode	No	The paper describes methods and mathematical proofs but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	The code for our metrics is available at https://github.com/Lou1s M/snc_nk. Most importantly, we release all the code to reproduce our experiments in an anonymous repo https://github.com/anon296/anon.
Open Datasets	Yes	Dsprites contains 737,280 black-and-white images with features (x, y)-coordinates, size, orientation and shape. 3dshapes contains 480,000 images with features object/ground/wall colour, size, camera azimuth, and shape. MPI3D contains 103,680 images of objects at the end of a robot arm with features object colour, size and shape, camera height and azimuth, and altitude of the robot arm.
Dataset Splits	Yes	That is, we test whether the representations produced by a model can be used to correctly classify novel combinations of familiar features. Following Xu et al. (2022), we 1. randomly sample values for two features, e.g., shape and size, 2. form a test set of points with those two values for those two features, e.g., all points with size=0 and shape= square , and a train set of all other points, 3. train the VAE (or supervised model) on the train set, 4. encode all data with the VAE encoder, 5. train and test an MLP with one output per feature, to predict the feature values from the encodings. The normal test set setting uses the same method except divides the train and test sets randomly.
Hardware Specification	Yes	All experiments were performed on a single Tesla V100 GPU on an internal compute cluster.
Software Dependencies	No	The paper mentions training MLPs and linear classification heads, and using Adam as an optimizer, but does not provide specific version numbers for software libraries, programming languages, or other key software components used in the experiments.
Experiment Setup	Yes	The MLPs and linear classification heads are trained using Adam, learning rate .001, β1=0.9, β2=0.999, for 75 epochs. The MLP has one hidden layer of size 256. MTD is trained using the author s code (obtained privately) with all default parameters, for 10 epochs. Other models are trained using the library at https://github.com/Yann Dubs/disentangling-vae, using all default parameters