Revisiting the Evaluation of Deep Learning-Based Compiler Testing

Authors: Yongqiang Tian, Zhenyang Xu, Yiwen Dong, Chengnian Sun, Shing-Chi Cheung

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with more than 1,500 CPU-hours demonstrate that the state-of-the-art DLGs fail to compete against such a simple baseline: 3 v.s. 1,750 hang bugs, 1 v.s. 34 distinct compiler crashes.
Researcher Affiliation Academia 1University of Waterloo 2The Hong Kong University of Science and Technology
Pseudocode No The paper describes the mutation operations and workflows conceptually and with figures, but it does not provide any formal pseudocode blocks or algorithms.
Open Source Code Yes We make Kitten publicly available at https://doi.org/10. 5281/zenodo.7946825 to benefit future research on DLGs.
Open Datasets Yes Following existing DLGs [Cummins et al., 2018; Liu et al., 2019], we constructed a dataset using all the C files of the testsuite of GCC 11.2.
Dataset Splits No The paper describes the dataset used to train the program generators (DLGs) and as input for Kitten, but it does not specify explicit training, validation, and test splits for this dataset used in the reproduction of the experiment.
Hardware Specification No each generator is deployed on a unique GPU virtual machine on a cloud platform with the same configuration. The paper mentions 'CPU-hours' and 'GPU-hour' experiments but does not specify exact GPU models, CPU models, or cloud instance types.
Software Dependencies No The paper mentions 'GCC 11.2', 'Antlr', 'LCOV', and 'Perses' but does not provide specific version numbers for all key software components (e.g., Antlr, LCOV, or specific programming language versions/libraries).
Experiment Setup Yes We choose a longer duration, i.e. 72 hours, for a comprehensive evaluation. ... we used 120 seconds as the timeout threshold.