CodeIPPrompt: Intellectual Property Infringement Assessment of Code Language Models

Authors: Zhiyuan Yu, Yuhao Wu, Ning Zhang, Chenguang Wang, Yevgeniy Vorobeychik, Chaowei Xiao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted an extensive evaluation of existing open-source code LMs and commercial products, and revealed the prevalence of IP violations in all these models.
Researcher Affiliation Academia 1Washington University in St. Louis 2Arizona State University 3University of Wisconsin-Madison. Correspondence to: Zhiyuan Yu <yu.zhiyuan@wustl.edu>, Chaowei Xiao <xiaocw@asu.edu>, Ning Zhang <zhang.ning@wustl.edu>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We release our code and datasets to encourage further progress in improving IP protection and compliance. Project website: https://sites.google.com/view/codeipprompt/.
Open Datasets Yes To create a comprehensive dataset for evaluation, we compiled a collection of licensed code repositories from Git Hub, totaling 4,075,553 across 34 different licenses. ... Specifically, we focused on The Pile, Code Parrot-Clean, Code Search Net, and Git Hub Code (GCPY), which were widely used and adopted to train Code Gen, Code Parrot, and Code RL models respectively.
Dataset Splits No The paper describes evaluation metrics and human studies for validating similarity scores and selecting thresholds, but it does not provide specific train/validation/test splits for models or their own main evaluation dataset.
Hardware Specification Yes Graphics Card 1 NVIDIA A100 (80GB VRAM) Graphics Card 2 NVIDIA Ge Force RTX 3090 (24GB VRAM) Central Processing Unit1 Intel i9-10920X CPU (3.50GHz) Central Processing Unit2 AMD EPYC 7742 64-Core Processor (1.80GHz)
Software Dependencies No The following experiments were carried out with the Hugging Face Transformers Library. ... Specifically, we built on Presidio (Microsoft, 2022) and spa Cy (Vasiliev, 2020) to customize context-aware anonymization logic...
Experiment Setup Yes Table 9. Hyperparameter for fine-tuning models. Hyperparameter Code RL Code Parrot optimizer Adam W Adam W initial learning rate 2e-5 5e-4 batch size 1 2 gradient accumulate steps 32 32 number of epochs 10 54 warmup steps 500 100 weight decay 0.05 0.1