reproducibilityindex.ai

CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching

Authors: Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, Shi Wu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our model on two datasets, where it outperforms other methods signiﬁcantly.
Researcher Affiliation	Collaboration	1Tencent Security Keen Lab, Shanghai, China 2Shanghai Jiao Tong University, Shanghai, China
Pseudocode	No	Not found. The paper describes methods in prose and diagrams but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	More information about the dataset could be found at https://github.com/binaryai. (This link is specifically for the dataset, not the source code for the methodology.)
Open Datasets	Yes	More information about the dataset could be found at https://github.com/binaryai.
Dataset Splits	Yes	Each dataset contains 30,000 source-binary pairs for training, 10,000 pairs for validation, and 10,000 pairs for testing.
Hardware Specification	No	Not found. The paper does not specify the hardware used for running its experiments.
Software Dependencies	No	For binary code, the IDA Pro tool is used to extract the tokens and features. (No version specified for IDA Pro or other software.)
Experiment Setup	Yes	For the training process, the training epoch is set to 64 for gcc-x64-O0 and 128 for clang-arm-O3. The learning rate is 0.001, the batch size is 32, the triplet margin is 0.5, and the optimizer is Adam. For source code, the length of character-level sequences is 4,096; the dimension is 64 on embedding layer and 128 on convolutional layers. The repeat number of residual blocks is 7. For binary code, the dimension of node embedding and graph embedding is both 128; the number of GGNN message passing iteration and Set2Set iteration are both 5. For strings and integers, the embedding layers dimension and LSTM s hidden dimension are both 64.