reproducibilityindex.ai

Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes

Authors: Connor Toups, Rishi Bommasani, Kathleen Creel, Sarah Bana, Dan Jurafsky, Percy S. Liang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across three modalities (text, images, speech) and eleven datasets, we establish a clear trend: deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available. To establish general trends in deployed machine learning, we draw upon a large-scale audit [HAPI; Chen et al., 2022a] that spans three modalities (images, speech, text), three commercial systems per modality, and eleven datasets overall.
Researcher Affiliation	Academia	Connor Toups Stanford University Rishi Bommasani Stanford University Kathleen A. Creel Northeastern University Sarah H. Bana Chapman University Dan Jurafsky Stanford University Percy Liang Stanford University
Pseudocode	No	The paper describes its analytical framework and findings but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	1All code is available at https://github.com/rishibommasani/Ecosystem Level Analysis.
Open Datasets	Yes	To establish general trends made visible through ecosystem-level analysis, we draw upon a large-scale three-year audit of commercial ML APIs [HAPI; Chen et al., 2022a] to study the behavior of deployed ML systems across three modalities, eleven datasets, and nine commercial systems. We compare outcomes from prominent dermatology models and board-certified dermatologists on the DDI dataset [Daneshjou et al., 2022]:
Dataset Splits	No	The paper discusses using 'evaluation datasets' and mentions 'the test set of FER+,' but it does not specify explicit training/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproducing its analysis setup.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to conduct its analysis or computations.
Software Dependencies	No	The paper does not list specific versions for any software libraries, frameworks, or programming languages used in their analysis.
Experiment Setup	No	The paper describes its analytical setup, such as defining 'potential improvements' and 'improvements' based on model behavior changes. However, it does not specify typical experimental setup details like hyperparameters, optimizers, or training configurations, as it is an analysis paper rather than one proposing a new model that requires training.