Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FACE: Faithful Automatic Concept Extraction

Authors: Dipkamal Bhusal, Michael Clifford, Sara Rampazzi, Nidhi Rastogi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Systematic evaluations on Image Net, COCO, and Celeb A datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics.
Researcher Affiliation	Collaboration	Dipkamal Bhusal Rochester Institute of Technology Rochester, NY EMAIL Michael Clifford Toyota Info Tech Labs Mountain View, CA EMAIL Sara Rampazzi University of Florida Gainesville, FL EMAIL Nidhi Rastogi Rochester Institute of Technology Rochester, NY EMAIL
Pseudocode	No	The paper describes methods and optimization steps but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Our code is available at https://github.com/dipkamal/FACE.
Open Datasets	Yes	Systematic evaluations on Image Net, COCO, and Celeb A datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics. Datasets and Models. We evaluate FACE on three datasets of varying semantic granularity: Image Net [7], COCO [19], and Celeb A [20]. We use Res Net-34 [14] and Mobile Net V2 [28] as target models for explanation.
Dataset Splits	Yes	All results are averaged over correctly-classified 10,000 samples from 10 different Image Net classes, 5,000 samples from 5 COCO classes, and 4,000 samples from the 4 selected Celeb A attributes.
Hardware Specification	Yes	We measured wall-clock time and peak VRAM on a single NVIDIA TITAN Xp (12 GB VRAM, CUDA 12.2) using Res Net-34, rank r = 25, and 1500 Image Net images (classwise run).
Software Dependencies	No	The paper mentions 'NVIDIA TITAN Xp (12 GB VRAM, CUDA 12.2)' but does not list specific software dependencies like programming languages or libraries with version numbers, e.g., 'Python 3.8, PyTorch 1.9'.
Experiment Setup	Yes	We optimize this using Adam with a learning rate of 5 4 and early stopping when the absolute change in total loss drops below below = 10 3. Non-negativity is enforced on U and W after each gradient update via in-place clamping. We sweep over {10 25, . . . , 1020} to select the best regularization value per dataset. We use matrix decomposition rank as 25 for experiments but provide ablation study on varying the decomposition rank hyperparameter in Section 4.4.