A New Burrows Wheeler Transform Markov Distance
Authors: Edward Raff, Charles Nicholas, Mark McLean5444-5453
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We will then move into empirical results in Section 5 by comparing BWMD with EBWT on DNA sequence clustering, where we show that BWMD is able to cluster DNA sequences of varying lengths that EBWT fails to cluster in a meaningful way. In Section 6 we will show how BWMD is able to scale to malware classification and clustering tasks that are beyond EBWT s computational ability. |
| Researcher Affiliation | Collaboration | Edward Raff,1,2,3 Charles Nicholas,3 Mark Mc Lean1 1Laboratory for Physical Sciences, 2Booz Allen Hamilton, 3University of Maryland, Baltimore County raff edward@bah.com, nicholas@umbc.edu, mclean@lps.umd.edu |
| Pseudocode | No | The description of BWMD in Section 3 provides a numbered list of steps for its implementation, but these steps are not formally labeled as 'Pseudocode' or 'Algorithm' in a dedicated block. |
| Open Source Code | No | The paper mentions that 'A Java (Raff and Nicholas 2018b) and Python (Raff, Aurelio, and Nicholas 2019) implementations of LZJD are available.' This refers to the code for LZJD, a baseline method, not the novel BWMD method described in this paper. No explicit statement or link is provided for the BWMD code. |
| Open Datasets | Yes | Such data can be obtained using the NIH Gene Bank , which we have used to create a similar corpus of DNA sequences to compare the relative pros and cons of BWMD and EBWT. The EMBER dataset (Anderson and Roth 2018) pertains to a binary classification problem of benign vs malicious for Windows executables. ... The raw files can be obtained from Virus Total (www.virustotal.com) and are nearly 1TB total. The Kaggle datasets are from a 2015 Kaggle competition sponsored by Microsoft (Ronen et al. 2018). From the Drebin corpus (Arp et al. 2014) we use the 20 most populous families... |
| Dataset Splits | Yes | For our evaluation, we will use several datasets summarized in Table 3. Using Virus Share (Roberts 2011) we create another Windows based dataset with 20 malware families. We select the 20 most populous families, and use 7,000 examples for training and 3,000 for testing. Table 4: Balanced Accuracy results for 1-NN classification on each dataset. Results show mean 10-Fold Cross Validation accuracy (standard deviation in parentheses). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU models, or memory specifications. It only mentions 'CPU hours' in the context of runtime performance without detailing the hardware. |
| Software Dependencies | No | The paper mentions general software like 'Java' and 'Python' for LZJD implementations, and 'k-Means algorithm' using 'Hamerly’s variant' or 'Average-Link clustering' but does not provide specific version numbers for any software, libraries, or dependencies used in their experiments. |
| Experiment Setup | Yes | For our evaluation, we will use several datasets summarized in Table 3. On EMBER we use 9-Nearest Neighbors as our classifier... BWMD is the only method that can leverage the k Means algorithm, and we use Hamerly s variant because it avoids redundant computation while returning the exact same results... For LZJD and Bit Shred we use Average-Link clustering using a fast O(n2) algorithm. Evaluating the quality of our clustering results, we will consider three measures: Homogeneity, Completeness, and VMeasure... In performing the clustering, we will test using k = the true number of classes and k = 10 the true number of classes. |