reproducibilityindex.ai

A New Burrows Wheeler Transform Markov Distance

Authors: Edward Raff, Charles Nicholas, Mark McLean5444-5453

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We will then move into empirical results in Section 5 by comparing BWMD with EBWT on DNA sequence clustering, where we show that BWMD is able to cluster DNA sequences of varying lengths that EBWT fails to cluster in a meaningful way. In Section 6 we will show how BWMD is able to scale to malware classiﬁcation and clustering tasks that are beyond EBWT s computational ability.
Researcher Affiliation	Collaboration	Edward Raff,1,2,3 Charles Nicholas,3 Mark Mc Lean1 1Laboratory for Physical Sciences, 2Booz Allen Hamilton, 3University of Maryland, Baltimore County raff edward@bah.com, nicholas@umbc.edu, mclean@lps.umd.edu
Pseudocode	No	The description of BWMD in Section 3 provides a numbered list of steps for its implementation, but these steps are not formally labeled as 'Pseudocode' or 'Algorithm' in a dedicated block.
Open Source Code	No	The paper mentions that 'A Java (Raff and Nicholas 2018b) and Python (Raff, Aurelio, and Nicholas 2019) implementations of LZJD are available.' This refers to the code for LZJD, a baseline method, not the novel BWMD method described in this paper. No explicit statement or link is provided for the BWMD code.
Open Datasets	Yes	Such data can be obtained using the NIH Gene Bank , which we have used to create a similar corpus of DNA sequences to compare the relative pros and cons of BWMD and EBWT. The EMBER dataset (Anderson and Roth 2018) pertains to a binary classiﬁcation problem of benign vs malicious for Windows executables. ... The raw ﬁles can be obtained from Virus Total (www.virustotal.com) and are nearly 1TB total. The Kaggle datasets are from a 2015 Kaggle competition sponsored by Microsoft (Ronen et al. 2018). From the Drebin corpus (Arp et al. 2014) we use the 20 most populous families...
Dataset Splits	Yes	For our evaluation, we will use several datasets summarized in Table 3. Using Virus Share (Roberts 2011) we create another Windows based dataset with 20 malware families. We select the 20 most populous families, and use 7,000 examples for training and 3,000 for testing. Table 4: Balanced Accuracy results for 1-NN classiﬁcation on each dataset. Results show mean 10-Fold Cross Validation accuracy (standard deviation in parentheses).
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU models, or memory specifications. It only mentions 'CPU hours' in the context of runtime performance without detailing the hardware.
Software Dependencies	No	The paper mentions general software like 'Java' and 'Python' for LZJD implementations, and 'k-Means algorithm' using 'Hamerly’s variant' or 'Average-Link clustering' but does not provide specific version numbers for any software, libraries, or dependencies used in their experiments.
Experiment Setup	Yes	For our evaluation, we will use several datasets summarized in Table 3. On EMBER we use 9-Nearest Neighbors as our classiﬁer... BWMD is the only method that can leverage the k Means algorithm, and we use Hamerly s variant because it avoids redundant computation while returning the exact same results... For LZJD and Bit Shred we use Average-Link clustering using a fast O(n2) algorithm. Evaluating the quality of our clustering results, we will consider three measures: Homogeneity, Completeness, and VMeasure... In performing the clustering, we will test using k = the true number of classes and k = 10 the true number of classes.