reproducibilityindex.ai

Revisiting the Evaluation of Deep Learning-Based Compiler Testing

Authors: Yongqiang Tian, Zhenyang Xu, Yiwen Dong, Chengnian Sun, Shing-Chi Cheung

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with more than 1,500 CPU-hours demonstrate that the state-of-the-art DLGs fail to compete against such a simple baseline: 3 v.s. 1,750 hang bugs, 1 v.s. 34 distinct compiler crashes.
Researcher Affiliation	Academia	1University of Waterloo 2The Hong Kong University of Science and Technology
Pseudocode	No	The paper describes the mutation operations and workflows conceptually and with figures, but it does not provide any formal pseudocode blocks or algorithms.
Open Source Code	Yes	We make Kitten publicly available at https://doi.org/10. 5281/zenodo.7946825 to benefit future research on DLGs.
Open Datasets	Yes	Following existing DLGs [Cummins et al., 2018; Liu et al., 2019], we constructed a dataset using all the C files of the testsuite of GCC 11.2.
Dataset Splits	No	The paper describes the dataset used to train the program generators (DLGs) and as input for Kitten, but it does not specify explicit training, validation, and test splits for this dataset used in the reproduction of the experiment.
Hardware Specification	No	each generator is deployed on a unique GPU virtual machine on a cloud platform with the same configuration. The paper mentions 'CPU-hours' and 'GPU-hour' experiments but does not specify exact GPU models, CPU models, or cloud instance types.
Software Dependencies	No	The paper mentions 'GCC 11.2', 'Antlr', 'LCOV', and 'Perses' but does not provide specific version numbers for all key software components (e.g., Antlr, LCOV, or specific programming language versions/libraries).
Experiment Setup	Yes	We choose a longer duration, i.e. 72 hours, for a comprehensive evaluation. ... we used 120 seconds as the timeout threshold.