CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching
Authors: Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, Shi Wu
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model on two datasets, where it outperforms other methods significantly. |
| Researcher Affiliation | Collaboration | 1Tencent Security Keen Lab, Shanghai, China 2Shanghai Jiao Tong University, Shanghai, China |
| Pseudocode | No | Not found. The paper describes methods in prose and diagrams but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | More information about the dataset could be found at https://github.com/binaryai. (This link is specifically for the dataset, not the source code for the methodology.) |
| Open Datasets | Yes | More information about the dataset could be found at https://github.com/binaryai. |
| Dataset Splits | Yes | Each dataset contains 30,000 source-binary pairs for training, 10,000 pairs for validation, and 10,000 pairs for testing. |
| Hardware Specification | No | Not found. The paper does not specify the hardware used for running its experiments. |
| Software Dependencies | No | For binary code, the IDA Pro tool is used to extract the tokens and features. (No version specified for IDA Pro or other software.) |
| Experiment Setup | Yes | For the training process, the training epoch is set to 64 for gcc-x64-O0 and 128 for clang-arm-O3. The learning rate is 0.001, the batch size is 32, the triplet margin is 0.5, and the optimizer is Adam. For source code, the length of character-level sequences is 4,096; the dimension is 64 on embedding layer and 128 on convolutional layers. The repeat number of residual blocks is 7. For binary code, the dimension of node embedding and graph embedding is both 128; the number of GGNN message passing iteration and Set2Set iteration are both 5. For strings and integers, the embedding layers dimension and LSTM s hidden dimension are both 64. |