reproducibilityindex.ai

CODE REPRESENTATION LEARNING AT SCALE

Authors: Dejiao Zhang, Wasi Uddin Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati, Dan Roth, Xiaofei Ma, Bing Xiang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size. 1
Researcher Affiliation	Industry	Dejiao Zhang & Wasi Uddin Ahmad {dejiaoz,wuahmad}@amazon.com Ming Tan & Hantian Ding {mingtan,dhantian}@amazon.com Ramesh Nallapati & Dan Roth & Xiaofei Ma & Bing Xiang {rnallapa,drot,xiaofeim,bxiang}@amazon.com AWS AI Labs
Pseudocode	No	The paper describes the algorithms and methods using mathematical equations and textual descriptions, but it does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	1Code and models can be found at https://code-representation-learning.github.io/.
Open Datasets	Yes	We train our models on The Stack dataset (Kocetkov et al., 2022) over nine languages Python, Java, Javascript, Typescript, C#, C, Ruby, Go, and PHP.
Dataset Splits	Yes	We present the label distribution for the Run Time error prediction dataset in Table 10. ... Table 10: Distribution of target classes in the Python Runtime Errors dataset. Target Class Train # Valid # Test # No error 1,20,503 13,049 13,745 Import Error 259 37 22 ...
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU specifications, or memory.
Software Dependencies	No	The paper mentions tools like 'tree-sitter' and 'Starcoder tokenizer' but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	Yes	We summarize the model hyper-parameters in Table 5. ... Table 5: Model architecture and pre-training related hyper-parameters. Stage1: Masked Language Modeling Dropout 0.1 Max steps 250,000 Warmup steps 5000 Batch size 2048 Base learning rate 3e-4. ... We present the hyper-parameters that we used while finetuning models for code classification tasks in Table 11. ... Table 11: Hyperparameters for fine-tuning baseline models and CODESAGE on code classification tasks. Optimizer Adam W Learning rate (LR) 1e-3 Batch size 32 # Epoch 10.