How Did the Model Change? Efficiently Assessing Machine Learning API Shifts

Authors: Lingjiao Chen, Matei Zaharia, James Zou

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we observe significant ML API shifts from 2020 to 2021 among 12 out of 36 applications using commercial APIs from Google, Microsoft, Amazon, and other providers. These real-world shifts include both improvements and reductions in accuracy. Extensive experiments show that MASA can estimate such API shifts more accurately than standard approaches given the same budget.
Researcher Affiliation Academia Lingjiao Chen, Matei Zaharia, James Y. Zou Stanford University {lingjiao,jamesz}@stanford.edu matei@cs.stanford.edu
Pseudocode Yes Algorithm 1 MASA’s ML API shift assessment algorithm. Input :ML API ˆy( ), query budget N, partitions Di,k, p RL K, C o RL L, and a > 0 Output :Estimated ML API Shift ˆC RL ˆL
Open Source Code Yes Our code and datasets are also released 1. https://github.com/lchen001/MASA
Open Datasets Yes We investigated twelve standard datasets across three different tasks, namely, YELP (Dat, c), IMDB (Maas et al.), WAIMAI (Dat, b), SHOP (Dat, a) for sentiment analysis, FER+ (Goodfellow et al., 2015), RAFDB (Li et al.), EXPW (Zhang et al.), AFNET (Mollahosseini et al., 2019) for facial emotion recognition, and DIGIT (Dat, d), AMNIST (Becker et al., 2018), CMD (Warden, 2018), FLUENT (Lugosch et al.), for speech recognition. [...] We also release our dataset of 1,224,278 samples annotated by commercial APIs in different dates as the first dataset and resource for studying ML API performance shifts.
Dataset Splits No The paper discusses 'testing partition' and 'testing portion' for certain datasets, but does not specify train/validation/test dataset splits with percentages, counts, or explicit mentions of a validation set's use in the main text.
Hardware Specification Yes All experiments were run on a machine with 2 E5-2690 v4 CPUs, 160 GB RAM and 500 GB disk with Ubuntu 18.04 LTS as the OS.
Software Dependencies Yes Our code is implemented and tested in python 3.7.
Experiment Setup Yes All experiments were averaged over 1500 runs. [...] In all tasks, we created partitions using difficulty levels induced by a cheap open source model from Git Hub. More details are in Appendix C. [...] The dataset is divided into 4 partitions based on (i) positive (+) or negative ( ) true labels, and (ii) low (l) or high (h) quality score.