Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Benchmarks, Algorithms, and Metrics for Hierarchical Disentanglement
Authors: Andrew Ross, Finale Doshi-Velez
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we develop benchmarks, algorithms, and metrics for learning such hierarchical representations. We ran experiments on nine benchmark datasets: Spaceshapes, and eight variants of Chopsticks... Results across metrics are shown for a subset of datasets and models in Fig. 6. |
| Researcher Affiliation | Academia | Andrew Slavin Ross 1 Finale Doshi-Velez 1 1Harvard University, Cambridge, MA, USA. |
| Pseudocode | Yes | Algorithm 1 MIMOSA(X); Algorithm 2 COFHAE(X) |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Our first benchmark dataset is Spaceshapes, a binary 64x64 image dataset meant to hierarchically extend d Sprites, a shape dataset common in the disentanglement literature (Matthey et al., 2017). |
| Dataset Splits | No | The paper does not specify exact split percentages or sample counts for training, validation, and test sets. It describes hyperparameter tuning process but not data partitioning. |
| Hardware Specification | No | The paper does not specify any particular hardware used for experiments (e.g., GPU models, CPU models, or cloud computing instances with detailed specs). |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | Over a grid of τ in { 1/3, 1}, λ1 in {10, 100, 1000}, and λ2 in {1, 10, 100}, we select the model with the lowest training reconstruction loss Lx from the 1/3 with the lowest assignment loss La. |