How Did the Model Change? Efficiently Assessing Machine Learning API Shifts
Authors: Lingjiao Chen, Matei Zaharia, James Zou
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we observe significant ML API shifts from 2020 to 2021 among 12 out of 36 applications using commercial APIs from Google, Microsoft, Amazon, and other providers. These real-world shifts include both improvements and reductions in accuracy. Extensive experiments show that MASA can estimate such API shifts more accurately than standard approaches given the same budget. |
| Researcher Affiliation | Academia | Lingjiao Chen, Matei Zaharia, James Y. Zou Stanford University {lingjiao,jamesz}@stanford.edu matei@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1 MASA’s ML API shift assessment algorithm. Input :ML API ˆy( ), query budget N, partitions Di,k, p RL K, C o RL L, and a > 0 Output :Estimated ML API Shift ˆC RL ˆL |
| Open Source Code | Yes | Our code and datasets are also released 1. https://github.com/lchen001/MASA |
| Open Datasets | Yes | We investigated twelve standard datasets across three different tasks, namely, YELP (Dat, c), IMDB (Maas et al.), WAIMAI (Dat, b), SHOP (Dat, a) for sentiment analysis, FER+ (Goodfellow et al., 2015), RAFDB (Li et al.), EXPW (Zhang et al.), AFNET (Mollahosseini et al., 2019) for facial emotion recognition, and DIGIT (Dat, d), AMNIST (Becker et al., 2018), CMD (Warden, 2018), FLUENT (Lugosch et al.), for speech recognition. [...] We also release our dataset of 1,224,278 samples annotated by commercial APIs in different dates as the first dataset and resource for studying ML API performance shifts. |
| Dataset Splits | No | The paper discusses 'testing partition' and 'testing portion' for certain datasets, but does not specify train/validation/test dataset splits with percentages, counts, or explicit mentions of a validation set's use in the main text. |
| Hardware Specification | Yes | All experiments were run on a machine with 2 E5-2690 v4 CPUs, 160 GB RAM and 500 GB disk with Ubuntu 18.04 LTS as the OS. |
| Software Dependencies | Yes | Our code is implemented and tested in python 3.7. |
| Experiment Setup | Yes | All experiments were averaged over 1500 runs. [...] In all tasks, we created partitions using difficulty levels induced by a cheap open source model from Git Hub. More details are in Appendix C. [...] The dataset is divided into 4 partitions based on (i) positive (+) or negative ( ) true labels, and (ii) low (l) or high (h) quality score. |