CODE REPRESENTATION LEARNING AT SCALE

Authors: Dejiao Zhang, Wasi Uddin Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size. 1
Researcher Affiliation Industry Dejiao Zhang & Wasi Uddin Ahmad {dejiaoz,wuahmad}@amazon.com Ming Tan & Hantian Ding {mingtan,dhantian}@amazon.com Ramesh Nallapati & Dan Roth & Xiaofei Ma & Bing Xiang {rnallapa,drot,xiaofeim,bxiang}@amazon.com AWS AI Labs
Pseudocode No The paper describes the algorithms and methods using mathematical equations and textual descriptions, but it does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes 1Code and models can be found at https://code-representation-learning.github.io/.
Open Datasets Yes We train our models on The Stack dataset (Kocetkov et al., 2022) over nine languages Python, Java, Javascript, Typescript, C#, C, Ruby, Go, and PHP.
Dataset Splits Yes We present the label distribution for the Run Time error prediction dataset in Table 10. ... Table 10: Distribution of target classes in the Python Runtime Errors dataset. Target Class Train # Valid # Test # No error 1,20,503 13,049 13,745 Import Error 259 37 22 ...
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU specifications, or memory.
Software Dependencies No The paper mentions tools like 'tree-sitter' and 'Starcoder tokenizer' but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes We summarize the model hyper-parameters in Table 5. ... Table 5: Model architecture and pre-training related hyper-parameters. Stage1: Masked Language Modeling Dropout 0.1 Max steps 250,000 Warmup steps 5000 Batch size 2048 Base learning rate 3e-4. ... We present the hyper-parameters that we used while finetuning models for code classification tasks in Table 11. ... Table 11: Hyperparameters for fine-tuning baseline models and CODESAGE on code classification tasks. Optimizer Adam W Learning rate (LR) 1e-3 Batch size 32 # Epoch 10.