Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HEAT: Hyperedge Attention Networks

Authors: Dobrik Georgiev Georgiev, Marc Brockschmidt, Miltiadis Allamanis

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate HEAT on two tasks from the literature: bug detection and repair (Allamanis et al., 2021) and knowledge base completion (Galkin et al., 2020). In both settings, it outperforms strong baselines, indicating its power and generality. Results We show the results of our experiments in Table 1, where Loc. refers to the accuracy in identifying the buggy location in an input program, Repair to the accuracy in determining the correct ﬁx given the buggy location, and Joint to solving both tasks together. The results indicate that HEAT improves performance on both considered datasets, improving the joint localisation and repair accuracy by 10% over the two well-tuned baselines.
Researcher Affiliation	Collaboration	Dobrik Georgiev EMAIL Department of Computer Science and Technology, University of Cambridge, UK Marc Brockschmidt EMAIL Microsoft Research, Cambridge, UK Miltiadis Allamanis EMAIL Microsoft Research, Cambridge, UK
Pseudocode	Yes	Algorithm 1 Greedy Hyperedge Packing into Microbatches buckets [ ] for h in sorted Descending By Width(hyperedges) do was Added False for bucket in buckets do if bucket.remaining Size h.width then bucket.add(h) was Added True break if not was Added then bucket Size smallest Fitting Microbatch Width(h.width) new Bucket create Bucket(bucket Size) new Bucket.add(e) buckets.append(new Bucket) return Group Buckets To Microbatches(buckets)
Open Source Code	Yes	Our implementation of the HEAT model is available on the heat branch of https://github.com/microsoft/neurips21-self-supervised-bug-detection-and-repair/tree/heat. This includes code for the extraction of hypergraph representations of Python code as discussed in Sec. 3.
Open Datasets	Yes	Dataset We use the code of Allamanis et al. (2021) to generate a dataset of randomly inserted bugs to train and evaluate a neural network in a supervised fashion. Consequently, we obtain a new variant of the Random Bugs test dataset, consisting of 760k graphs. We additionally re-extract the Py PIBugs dataset with the provided script, generating hypergraphs as consumed by HEAT, and graphs generated by the baseline models. Dataset Following the discovery of test leaks and design ﬂaws by Galkin et al. (2020) in common benchmark datasets such as Wiki People (Guan et al., 2019) and JF17K (Wen et al., 2016) we chose one of the variations of the new WD50K dataset presented there WD50K (100).
Dataset Splits	No	Dataset We use the code of Allamanis et al. (2021) to generate a dataset of randomly inserted bugs to train and evaluate a neural network in a supervised fashion. Consequently, we obtain a new variant of the Random Bugs test dataset, consisting of 760k graphs. We additionally re-extract the Py PIBugs dataset with the provided script, generating hypergraphs as consumed by HEAT, and graphs generated by the baseline models. Training and evaluation Training is performed as in Galkin et al. (2020) using binary cross entropy with label smoothing. We trained our model for 1k epochs with a learning rate of 0.0004 and batch size of 512. Hyperparameters were ﬁne-tuned manually, using the provided validation set in the Star E implementation. (Explanation: The paper mentions using code to generate datasets for training and evaluation, and fine-tuning hyperparameters using a validation set. However, it does not provide specific percentages, counts, or methodology for how the datasets are split into training, validation, and test sets, which is necessary for reproducibility.)
Hardware Specification	No	On the other hand, processing each hyperedge separately would not make use of parallel computation in modern GPUs. This process is performed in-CPU during the minibatch preparation. (Explanation: The paper mentions "modern GPUs" and "in-CPU" for computation, but does not provide specific models or specifications for the hardware used in the experiments.)
Software Dependencies	No	We implemented it as a Py Torch (Paszke et al., 2019) Module, available on the heat branch of https://github.com/microsoft/neurips21-self-supervised-bug-detection-and-repair/tree/heat. (Explanation: The paper mentions "Py Torch" as the framework used, but does not specify a version number, nor does it list any other software dependencies with their versions.)
Experiment Setup	Yes	Model Architecture We modify the architecture of Allamanis et al. (2021) to use 6 HEAT layers with hidden dimension of 256, 8 heads, feed-forward (FFN in Eq. 4) hidden layer of 2048, and dropout rate of 0.1. We used a single layer of HEAT with embedding size of 100 and a single layer of the Transformer used for calculating the ﬁnal predictions. We trained our model for 1k epochs with a learning rate of 0.0004 and batch size of 512.