Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Structural Language Models of Code
Authors: Uri Alon, Roy Sadaka, Omer Levy, Eran Yahav
ICML 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SLMs on Java any-code completion, achieving a new state of the art: exact-match accuracy@1 of 18.04% and accuracy@5 of 24.83%...", "4. Experimental Setup", "Table 1. Results on any-code completion in Java.", "6. Ablation Study" |
| Researcher Affiliation | Collaboration | 1Technion, Israel 2Tel Aviv University 3Facebook AI Research. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks labeled as such. |
| Open Source Code | Yes | Our code, data, and trained models are available at http://github.com/tech-srl/ slm-code-generation/. |
| Open Datasets | Yes | We take the Java-small dataset of Alon et al. (2019a), which is a re-split of the dataset of Allamanis et al. (2016).extracted examples from the raw dataset of Allamanis et al. (2018) using their unseen projects test set. |
| Dataset Splits | Yes | Ultimately, this dataset contains 1.3M/10k/20k train/dev/test examples. This dataset contains 16k/8k/3k train/dev/test examples. |
| Hardware Specification | Yes | We train the model end-to-end on a single V100 GPU, using cross entropy and the Adam optimizer (Kingma & Ba, 2015), an initial learning rate of 10 4 multiplied by 0.95 every 20k steps. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and 'Open NMT' for baselines, but does not provide specific version numbers for key software dependencies (e.g., Python, PyTorch/TensorFlow, specific libraries with versions) used for their own model's implementation. |
| Experiment Setup | Yes | We use embeddings of size 512, 2 layers of LSTMs with 256 units, and 4 transformer layers with 8 attention heads. initial learning rate of 10 4 multiplied by 0.95 every 20k steps. vary the batch size such that each batch contains about 512 targets. apply dropout of 0.25 in the Transformer layers, and a recurrent dropout of 0.5 in the LSTMs. |