CodeIPPrompt: Intellectual Property Infringement Assessment of Code Language Models
Authors: Zhiyuan Yu, Yuhao Wu, Ning Zhang, Chenguang Wang, Yevgeniy Vorobeychik, Chaowei Xiao
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted an extensive evaluation of existing open-source code LMs and commercial products, and revealed the prevalence of IP violations in all these models. |
| Researcher Affiliation | Academia | 1Washington University in St. Louis 2Arizona State University 3University of Wisconsin-Madison. Correspondence to: Zhiyuan Yu <yu.zhiyuan@wustl.edu>, Chaowei Xiao <xiaocw@asu.edu>, Ning Zhang <zhang.ning@wustl.edu>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our code and datasets to encourage further progress in improving IP protection and compliance. Project website: https://sites.google.com/view/codeipprompt/. |
| Open Datasets | Yes | To create a comprehensive dataset for evaluation, we compiled a collection of licensed code repositories from Git Hub, totaling 4,075,553 across 34 different licenses. ... Specifically, we focused on The Pile, Code Parrot-Clean, Code Search Net, and Git Hub Code (GCPY), which were widely used and adopted to train Code Gen, Code Parrot, and Code RL models respectively. |
| Dataset Splits | No | The paper describes evaluation metrics and human studies for validating similarity scores and selecting thresholds, but it does not provide specific train/validation/test splits for models or their own main evaluation dataset. |
| Hardware Specification | Yes | Graphics Card 1 NVIDIA A100 (80GB VRAM) Graphics Card 2 NVIDIA Ge Force RTX 3090 (24GB VRAM) Central Processing Unit1 Intel i9-10920X CPU (3.50GHz) Central Processing Unit2 AMD EPYC 7742 64-Core Processor (1.80GHz) |
| Software Dependencies | No | The following experiments were carried out with the Hugging Face Transformers Library. ... Specifically, we built on Presidio (Microsoft, 2022) and spa Cy (Vasiliev, 2020) to customize context-aware anonymization logic... |
| Experiment Setup | Yes | Table 9. Hyperparameter for fine-tuning models. Hyperparameter Code RL Code Parrot optimizer Adam W Adam W initial learning rate 2e-5 5e-4 batch size 1 2 gradient accumulate steps 32 32 number of epochs 10 54 warmup steps 500 100 weight decay 0.05 0.1 |