A Zero-Positive Learning Approach for Diagnosing Software Performance Regressions

Authors: Mejbah Alam, Justin Gottschlich, Nesime Tatbul, Javier S. Turek, Tim Mattson, Abdullah Muzahid

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate Auto Perf s generality and efficacy against 3 types of performance regressions across 10 real performance bugs in 7 benchmark and open-source programs. On average, Auto Perf exhibits 4% profiling overhead and accurately diagnoses more performance bugs than prior state-of-the-art approaches. Thus far, Auto Perf has produced no false negatives.
Researcher Affiliation Collaboration Mejbah Alam Intel Labs mejbah.alam@intel.com Justin Gottschlich Intel Labs justin.gottschlich@intel.com Nesime Tatbul Intel Labs and MIT tatbul@csail.mit.edu Javier Turek Intel Labs javier.turek@intel.com Timothy Mattson Intel Labs timothy.g.mattson@intel.com Abdullah Muzahid Texas A&M University abdullah.muzahid@tamu.edu
Pseudocode No The paper describes the system components and their interactions but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement or a direct link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We used 7 programs with known performance defects from the PARSEC [17] and the Phoenix [57] benchmark suites. Additionally, we evaluated 3 open-source programs: Boost [2], Memcached [4], and My SQL [5].
Dataset Splits No The paper describes training on an 'old version' and testing on a 'new version' and mentions running 'n number of times' but does not specify standard dataset splits (e.g., 80/10/10 percentages, sample counts for train/val/test) or explicit cross-validation methodology for data partitioning typically associated with ML reproducibility.
Hardware Specification Yes We performed all experiments on a 12-core dual socket Intel Xeon R Scalable 8268 processor [3] with 32GB RAM.
Software Dependencies No The paper mentions 'PAPI to read hardware performance counter values [49]' and 'Keras with Tensor Flow to implement autoencoders [19]' but does not provide specific version numbers for these software components.
Experiment Setup Yes Given two versions of a software program, Auto Perf first compares their performance. If a degradation is observed, then the cause is likely to lie within the functions that differ in the two versions. Hence, Auto Perf automatically annotates the modified functions in both versions of the program and collects their HWPC profiles. The data collected for the older version is used for zero-positive model training, whereas the data collected for the newer version is used for inferencing based on the trained model. Auto Perf uses an autoencoder neural network to model normal performance behavior of a function [60]. To scale with a large number of functions, training data for functions with similar performance signatures are clustered together using k-means clustering and a single autoencoder model per cluster is trained [35]. Performance regressions are identified by measuring the reconstruction error that results from testing the autoencoders with profile data from the new version of the program. If the error comes out to be sufficiently high, then the corresponding execution of the function is marked as a performance bug and its root cause is analyzed as the final step of the diagnosis. ... The t parameter controls the level of thresholding. For example, with t = 2, the threshold provides (approximately) a 95% confidence interval for the reconstruction error.