Coda: An End-to-End Neural Program Decompiler
Authors: Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, Jishen Zhao
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We assess Coda s performance with extensive experiments on various benchmarks. Evaluation results show that Coda achieves an average of 82% program recovery accuracy on unseen binary samples, where the state-of-the-art decompilers yield 0% accuracy. Furthermore, Coda outperforms the sequence-to-sequence model with attention by a margin of 70% program accuracy. |
| Researcher Affiliation | Collaboration | Cheng Fu, Huili Chen, Haolan Liu UC San Diego {cfu,huc044,hal022}@ucsd.edu Xinyun Chen UC Berkeley xinyun.chen@berkeley.edu Yuandong Tian Facebook yuandong@fb.com Farinaz Koushanfar, Jishen Zhao UC San Diego {farinaz,jzhao}@ucsd.edu |
| Pseudocode | Yes | Algorithm 1 Workflow of iterative EC Machine. |
| Open Source Code | No | The paper mentions using open-source disassemblers (mipt-mips, REDasm) but does not state that the code for Coda itself is open-source or provide a link. |
| Open Datasets | No | To build the training dataset for stage 1, we randomly generate 50,000 pairs of high-level programs with the corresponding assembly code for each task. The training dataset for the error correction stage is constructed by injecting various types of errors into the high-level code. The paper generated its own dataset and does not provide public access information. |
| Dataset Splits | No | The paper does not provide explicit details about a validation dataset split (e.g., percentages or counts). |
| Hardware Specification | No | The paper mentions "limited GPU memory" as a challenge for long programs but does not specify any particular GPU model, CPU, or other hardware used for the experiments. |
| Software Dependencies | No | The paper mentions using `clang` for compilation and `mipt-mips` and `REDasm` for disassembling, but it does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We set Smax = 30 and cmax = 10 for EC machine in Algorithm 1. In our experiments, we inject 10 20% token errors whose locations are sampled from a uniform random distribution. To address the class imbalance problem during EP training, we mask 35% of the tokens with error status 0 (i.e., no error occurs) when computing the loss. The program is compiled using clang with configuration -0O which disables all optimizations. |