CODE REPRESENTATION LEARNING AT SCALE
Authors: Dejiao Zhang, Wasi Uddin Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size. 1 |
| Researcher Affiliation | Industry | Dejiao Zhang & Wasi Uddin Ahmad {dejiaoz,wuahmad}@amazon.com Ming Tan & Hantian Ding {mingtan,dhantian}@amazon.com Ramesh Nallapati & Dan Roth & Xiaofei Ma & Bing Xiang {rnallapa,drot,xiaofeim,bxiang}@amazon.com AWS AI Labs |
| Pseudocode | No | The paper describes the algorithms and methods using mathematical equations and textual descriptions, but it does not include structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | 1Code and models can be found at https://code-representation-learning.github.io/. |
| Open Datasets | Yes | We train our models on The Stack dataset (Kocetkov et al., 2022) over nine languages Python, Java, Javascript, Typescript, C#, C, Ruby, Go, and PHP. |
| Dataset Splits | Yes | We present the label distribution for the Run Time error prediction dataset in Table 10. ... Table 10: Distribution of target classes in the Python Runtime Errors dataset. Target Class Train # Valid # Test # No error 1,20,503 13,049 13,745 Import Error 259 37 22 ... |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU specifications, or memory. |
| Software Dependencies | No | The paper mentions tools like 'tree-sitter' and 'Starcoder tokenizer' but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We summarize the model hyper-parameters in Table 5. ... Table 5: Model architecture and pre-training related hyper-parameters. Stage1: Masked Language Modeling Dropout 0.1 Max steps 250,000 Warmup steps 5000 Batch size 2048 Base learning rate 3e-4. ... We present the hyper-parameters that we used while finetuning models for code classification tasks in Table 11. ... Table 11: Hyperparameters for fine-tuning baseline models and CODESAGE on code classification tasks. Optimizer Adam W Learning rate (LR) 1e-3 Batch size 32 # Epoch 10. |