Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases
Authors: Zian Su, Xiangzhe Xu, Ziyang Huang, Kaiyuan Zhang, Xiangyu Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Pro Rec on a diversified dataset compiled from Git Hub repositories, demonstrating improvements of 3.1% (10.3% relative gain) in CHRF and 12% (16.7% relative gain) in a GPT4-based metric that has high correlation with human judgement on the summarization task over zero-shot baseline. |
| Researcher Affiliation | Academia | Zian Su1 Xiangzhe Xu1 Ziyang Huang2 Kaiyuan Zhang1 Xiangyu Zhang1 1 Purdue university 2 Johns Hopkins University |
| Pseudocode | No | The paper contains diagrams and code snippets in figures, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | 2Our code and data are available at https://github.com/ziansu/prorec. |
| Open Datasets | Yes | 2Our code and data are available at https://github.com/ziansu/prorec. In total, our data consists of 270k pairs of binary and source code functions. |
| Dataset Splits | Yes | We split 260k data samples for training and 10k data samples for test. We use 5% of the training data as the validation dataset. |
| Hardware Specification | Yes | Our training is conducted using 4 NVIDIA A100s. |
| Software Dependencies | Yes | We choose the Code-Llama [55] family as our base SCFM 4. ... The versions of the black-box LLM recoverers are gpt-3.5-turbo-1106 for GPT3.5-turbo, claude-3-haiku-20240307 for Claude-3, gemini-1.0-pro for Gemini-Pro, and gpt-4-turbo-2024-04-09 for GPT4 Evaluator. |
| Experiment Setup | Yes | We train the model with learning rate 5e-5, a batch size of 16, 1k warmup steps, and 17k total steps. For memory efficiency, we apply quantization (4bit or 8bit) [17, 18] to the base SCFM during alignment. |