Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HEAT: Hyperedge Attention Networks

Authors: Dobrik Georgiev Georgiev, Marc Brockschmidt, Miltiadis Allamanis

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate HEAT on two tasks from the literature: bug detection and repair (Allamanis et al., 2021) and knowledge base completion (Galkin et al., 2020). In both settings, it outperforms strong baselines, indicating its power and generality. Results We show the results of our experiments in Table 1, where Loc. refers to the accuracy in identifying the buggy location in an input program, Repair to the accuracy in determining the correct fix given the buggy location, and Joint to solving both tasks together. The results indicate that HEAT improves performance on both considered datasets, improving the joint localisation and repair accuracy by 10% over the two well-tuned baselines.
Researcher Affiliation Collaboration Dobrik Georgiev EMAIL Department of Computer Science and Technology, University of Cambridge, UK Marc Brockschmidt EMAIL Microsoft Research, Cambridge, UK Miltiadis Allamanis EMAIL Microsoft Research, Cambridge, UK
Pseudocode Yes Algorithm 1 Greedy Hyperedge Packing into Microbatches buckets [ ] for h in sorted Descending By Width(hyperedges) do was Added False for bucket in buckets do if bucket.remaining Size h.width then bucket.add(h) was Added True break if not was Added then bucket Size smallest Fitting Microbatch Width(h.width) new Bucket create Bucket(bucket Size) new Bucket.add(e) buckets.append(new Bucket) return Group Buckets To Microbatches(buckets)
Open Source Code Yes Our implementation of the HEAT model is available on the heat branch of https://github.com/microsoft/neurips21-self-supervised-bug-detection-and-repair/tree/heat. This includes code for the extraction of hypergraph representations of Python code as discussed in Sec. 3.
Open Datasets Yes Dataset We use the code of Allamanis et al. (2021) to generate a dataset of randomly inserted bugs to train and evaluate a neural network in a supervised fashion. Consequently, we obtain a new variant of the Random Bugs test dataset, consisting of 760k graphs. We additionally re-extract the Py PIBugs dataset with the provided script, generating hypergraphs as consumed by HEAT, and graphs generated by the baseline models. Dataset Following the discovery of test leaks and design flaws by Galkin et al. (2020) in common benchmark datasets such as Wiki People (Guan et al., 2019) and JF17K (Wen et al., 2016) we chose one of the variations of the new WD50K dataset presented there WD50K (100).
Dataset Splits No Dataset We use the code of Allamanis et al. (2021) to generate a dataset of randomly inserted bugs to train and evaluate a neural network in a supervised fashion. Consequently, we obtain a new variant of the Random Bugs test dataset, consisting of 760k graphs. We additionally re-extract the Py PIBugs dataset with the provided script, generating hypergraphs as consumed by HEAT, and graphs generated by the baseline models. Training and evaluation Training is performed as in Galkin et al. (2020) using binary cross entropy with label smoothing. We trained our model for 1k epochs with a learning rate of 0.0004 and batch size of 512. Hyperparameters were fine-tuned manually, using the provided validation set in the Star E implementation. (Explanation: The paper mentions using code to generate datasets for training and evaluation, and fine-tuning hyperparameters using a validation set. However, it does not provide specific percentages, counts, or methodology for how the datasets are split into training, validation, and test sets, which is necessary for reproducibility.)
Hardware Specification No On the other hand, processing each hyperedge separately would not make use of parallel computation in modern GPUs. This process is performed in-CPU during the minibatch preparation. (Explanation: The paper mentions "modern GPUs" and "in-CPU" for computation, but does not provide specific models or specifications for the hardware used in the experiments.)
Software Dependencies No We implemented it as a Py Torch (Paszke et al., 2019) Module, available on the heat branch of https://github.com/microsoft/neurips21-self-supervised-bug-detection-and-repair/tree/heat. (Explanation: The paper mentions "Py Torch" as the framework used, but does not specify a version number, nor does it list any other software dependencies with their versions.)
Experiment Setup Yes Model Architecture We modify the architecture of Allamanis et al. (2021) to use 6 HEAT layers with hidden dimension of 256, 8 heads, feed-forward (FFN in Eq. 4) hidden layer of 2048, and dropout rate of 0.1. We used a single layer of HEAT with embedding size of 100 and a single layer of the Transformer used for calculating the final predictions. We trained our model for 1k epochs with a learning rate of 0.0004 and batch size of 512.