reproducibilityindex.ai

Is Feature Selection Secure against Training Data Poisoning?

Authors: Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, Fabio Roli

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results on malware detection show that feature selection methods can be signiﬁcantly compromised under attack... We report experiments on PDF malware detection... In our experiments, we exploit the feature representation proposed by Maiorca et al. (2012)... Experimental results. Results are reported in Fig. 2...
Researcher Affiliation	Academia	Department of Computer Science, Technische Universit at M unchen, Boltzmannstr.3, 85748 Garching, Germany; Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d Armi, 09123 Cagliari, Italy; School of Computer Science, University of Manchester, Oxford Road, M13 9PL, UK
Pseudocode	Yes	Algorithm 1 Poisoning Embedded Feature Selection
Open Source Code	No	No explicit statement or link providing concrete access to source code for the methodology described in this paper was found.
Open Datasets	No	We collected 5993 recent malware samples from the Contagio dataset,4 and 5951 benign samples from the web. (Footnote 4: http://contagiodump.blogspot.it). No formal citation with authors/year or a recognized dataset repository link was provided for public access to the specific dataset.
Dataset Splits	Yes	We then randomly sampled ﬁve pairs of training and test sets from the remaining data, respectively consisting of 300 and 5,000 samples, to average the ﬁnal results. To simulate LK attacks (Sect. 2.2), we also sampled an additional set of ﬁve training sets (to serve as ˆD) consisting of 300 samples each.
Hardware Specification	No	No specific hardware details (such as GPU/CPU models, memory, or cloud instance types) used for running the experiments were provided.
Software Dependencies	No	No specific version numbers for software dependencies were provided. The paper references 'Scikit-learn: Machine learning in Python' (Pedregosa et al., 2011) but does not state its version or other software versions used in the experiments.
Experiment Setup	Yes	We ﬁrst set ρ = 0.5 for the elastic net, and then optimized the regularization parameter λ for all methods by retaining the best value over the entire regularization path (Friedman et al., 2010; Pedregosa et al., 2011). We normalized each feature between 0 and 1 by bounding the maximum keyword count to 20, and dividing each feature value by the same value.