Revisiting the Evaluation of Deep Learning-Based Compiler Testing
Authors: Yongqiang Tian, Zhenyang Xu, Yiwen Dong, Chengnian Sun, Shing-Chi Cheung
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with more than 1,500 CPU-hours demonstrate that the state-of-the-art DLGs fail to compete against such a simple baseline: 3 v.s. 1,750 hang bugs, 1 v.s. 34 distinct compiler crashes. |
| Researcher Affiliation | Academia | 1University of Waterloo 2The Hong Kong University of Science and Technology |
| Pseudocode | No | The paper describes the mutation operations and workflows conceptually and with figures, but it does not provide any formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | We make Kitten publicly available at https://doi.org/10. 5281/zenodo.7946825 to benefit future research on DLGs. |
| Open Datasets | Yes | Following existing DLGs [Cummins et al., 2018; Liu et al., 2019], we constructed a dataset using all the C files of the testsuite of GCC 11.2. |
| Dataset Splits | No | The paper describes the dataset used to train the program generators (DLGs) and as input for Kitten, but it does not specify explicit training, validation, and test splits for this dataset used in the reproduction of the experiment. |
| Hardware Specification | No | each generator is deployed on a unique GPU virtual machine on a cloud platform with the same configuration. The paper mentions 'CPU-hours' and 'GPU-hour' experiments but does not specify exact GPU models, CPU models, or cloud instance types. |
| Software Dependencies | No | The paper mentions 'GCC 11.2', 'Antlr', 'LCOV', and 'Perses' but does not provide specific version numbers for all key software components (e.g., Antlr, LCOV, or specific programming language versions/libraries). |
| Experiment Setup | Yes | We choose a longer duration, i.e. 72 hours, for a comprehensive evaluation. ... we used 120 seconds as the timeout threshold. |