Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

Authors: Linyuan Gong, Sida Wang, Mostafa Elhoushi, Alvin Cheung

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs.
Researcher Affiliation Collaboration 1Department of EECS, University of California at Berkeley, Berkeley, California, USA 2AI at Meta, USA.
Pseudocode No The paper includes code examples in figures but does not contain pseudocode or explicitly labeled algorithm blocks.
Open Source Code Yes The evaluation toolkit and dataset are available at https://github. com/gonglinyuan/safim
Open Datasets Yes The evaluation toolkit and dataset are available at https://github. com/gonglinyuan/safim
Dataset Splits No The paper does not provide specific training/validation/test dataset splits for reproducibility within its own benchmark setup. The entire SAFIM dataset is used for evaluation (testing) purposes for the LLMs.
Hardware Specification No The paper mentions using the Open AI API and Huggingface transformers library for generation, but does not specify the hardware used for these operations or for their own experiments beyond general mentions of "computational resources".
Software Dependencies No The paper mentions using the Open AI API and the Huggingface transformers library, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For the remaining models, generation is conducted via the Huggingface transformers library, following established practices in Fried et al. (2023), where we use top-p random sampling with p = 0.95 and a temperature of 0.2.