CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching

Authors: Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, Shi Wu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model on two datasets, where it outperforms other methods significantly.
Researcher Affiliation Collaboration 1Tencent Security Keen Lab, Shanghai, China 2Shanghai Jiao Tong University, Shanghai, China
Pseudocode No Not found. The paper describes methods in prose and diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code No More information about the dataset could be found at https://github.com/binaryai. (This link is specifically for the dataset, not the source code for the methodology.)
Open Datasets Yes More information about the dataset could be found at https://github.com/binaryai.
Dataset Splits Yes Each dataset contains 30,000 source-binary pairs for training, 10,000 pairs for validation, and 10,000 pairs for testing.
Hardware Specification No Not found. The paper does not specify the hardware used for running its experiments.
Software Dependencies No For binary code, the IDA Pro tool is used to extract the tokens and features. (No version specified for IDA Pro or other software.)
Experiment Setup Yes For the training process, the training epoch is set to 64 for gcc-x64-O0 and 128 for clang-arm-O3. The learning rate is 0.001, the batch size is 32, the triplet margin is 0.5, and the optimizer is Adam. For source code, the length of character-level sequences is 4,096; the dimension is 64 on embedding layer and 128 on convolutional layers. The repeat number of residual blocks is 7. For binary code, the dimension of node embedding and graph embedding is both 128; the number of GGNN message passing iteration and Set2Set iteration are both 5. For strings and integers, the embedding layers dimension and LSTM s hidden dimension are both 64.