Is Feature Selection Secure against Training Data Poisoning?
Authors: Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, Fabio Roli
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results on malware detection show that feature selection methods can be significantly compromised under attack... We report experiments on PDF malware detection... In our experiments, we exploit the feature representation proposed by Maiorca et al. (2012)... Experimental results. Results are reported in Fig. 2... |
| Researcher Affiliation | Academia | Department of Computer Science, Technische Universit at M unchen, Boltzmannstr.3, 85748 Garching, Germany; Department of Electrical and Electronic Engineering, University of Cagliari, Piazza d Armi, 09123 Cagliari, Italy; School of Computer Science, University of Manchester, Oxford Road, M13 9PL, UK |
| Pseudocode | Yes | Algorithm 1 Poisoning Embedded Feature Selection |
| Open Source Code | No | No explicit statement or link providing concrete access to source code for the methodology described in this paper was found. |
| Open Datasets | No | We collected 5993 recent malware samples from the Contagio dataset,4 and 5951 benign samples from the web. (Footnote 4: http://contagiodump.blogspot.it). No formal citation with authors/year or a recognized dataset repository link was provided for public access to the specific dataset. |
| Dataset Splits | Yes | We then randomly sampled five pairs of training and test sets from the remaining data, respectively consisting of 300 and 5,000 samples, to average the final results. To simulate LK attacks (Sect. 2.2), we also sampled an additional set of five training sets (to serve as ˆD) consisting of 300 samples each. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models, memory, or cloud instance types) used for running the experiments were provided. |
| Software Dependencies | No | No specific version numbers for software dependencies were provided. The paper references 'Scikit-learn: Machine learning in Python' (Pedregosa et al., 2011) but does not state its version or other software versions used in the experiments. |
| Experiment Setup | Yes | We first set ρ = 0.5 for the elastic net, and then optimized the regularization parameter λ for all methods by retaining the best value over the entire regularization path (Friedman et al., 2010; Pedregosa et al., 2011). We normalized each feature between 0 and 1 by bounding the maximum keyword count to 20, and dividing each feature value by the same value. |