reproducibilityindex.ai

Feature selection in functional data classification with recursive maxima hunting

Authors: José L. Torrecilla, Alberto Suárez

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The results of an extensive empirical evaluation are used to illustrate that, in the problems investigated, RMH has comparable or higher predictive accuracy than standard dimensionality reduction techniques, such as PCA and PLS, and state-of-the-art feature selection methods for functional data, such as maxima hunting.
Researcher Affiliation	Academia	Jos e L. Torrecilla Computer Science Department Universidad Aut onoma de Madrid 28049 Madrid, Spain joseluis.torrecilla@uam.es Alberto Su arez Computer Science Department Universidad Aut onoma de Madrid 28049 Madrid, Spain alberto.suarez@uam.es
Pseudocode	Yes	Algorithm 1 Recursive Maxima Hunting 1: function RMH(X(t), Y ) 2: t [ ] Vector of selected points initially empty 3: RMH rec(X(t),Y ,0,1) Recursive search of the maxima of R2(X(t), Y ) 4: return t Vector of selected points 5: end function 6: procedure RMH REC(X(t), Y, tinf, tsup) 7: tmax argmax tinf t tsup R2(X(t), Y ) 8: if R2(X(tmax), Y ) > s then 9: t [t tmax] Include tmax in t the vector of selected points 10: X(t) X(t) E(X(t) \| X(tmax)), t [tinf, tsup] Correction of type (5) or (6) as required 11: else 12: return 13: end if 14: Exclude redundant points to the left of tmax 15: t max max tinf t<tmax t : R2 (X(tmax), X(t)) r 16: if t max > tinf then 17: RMH rec(X(t), Y, tinf, t max) Recursion on left subinterval 18: end if 19: Exclude redundant points to the right of tmax 20: t+ max min tmax<t tsup t : R2 (X(tmax), X(t)) r 21: if t+ max < tsup then 22: RMH rec(X(t), Y, t+ max, tsup) Recursion on right subinterval 23: end if 24: return 25: end procedure
Open Source Code	No	No explicit statement or link providing concrete access to the source code for the described methodology was found.
Open Datasets	Yes	To assess the performance of RMH in real-world functional classiﬁcation problems, we have carried out a second batch of experiments in four datasets, which are commonly used as benchmarks in the FDA literature. Instances in Growth correspond to curves of the heights of 54 girls and 38 boys from the Berkeley Growth Study. Observations are discretized in 31 non-equidistant ages between 1 and 18 years (Ramsay and Silverman, 2005; Mosler and Mozharovskyi, 2014). The Tecator dataset consists of 215 near-infrared absorbance spectra of ﬁnely chopped meat. The spectral curves consist of 100 equally spaced points. The class labels are determined in terms of fat content (above or below 20%). The curves are fairly smooth. In consequence, we have followed the general recommendation and used the second derivative for classiﬁcation (Ferraty and Vieu, 2006; Galeano et al., 2014). The Phoneme data consists of 4509 log-periodograms observed at 256 equidistant points. Here, we consider the binary problem of distinguishing between the phonemes aa (695) and ao (1022) (Galeano et al., 2014). Following Delaigle and Hall (2012a), the curves are smoothed with a local linear method and truncated to the ﬁrst 50 variables. The Medﬂies are records of daily egg-laying patterns of a thousand ﬂies. The goal is to discriminate between shortand long-lived ﬂies. Following Mosler and Mozharovskyi (2014), curves equal to zero are excluded. There are 512 30-day curves (starting from day 5) of ﬂies who live at most 34 days, 266 of these are long-lived (reach the day 44). The classes in Growth and Tecator are well separated. In consequence, they are relatively easy problems. By contrast, Phoneme and Medﬂies are notoriously difﬁcult classiﬁcation tasks.
Dataset Splits	Yes	The value k in k NN is selected by 10-fold CV from integers in [1, Ntrain], where Ntrain is the size of the training set. The value of s is selected from the set {0.025, 0.05, 0.1} by 10-fold CV. In the other methods, the number of selected variables is determined using 10-fold CV, with maximum of 30. In these experiments, training samples of different sizes (Ntrain = {50, 100, 200, 500, 1000}) and an independent test set of size 1000 are generated. To estimate the classiﬁcation error, the datasets are partitioned at random into a training set (with 2/3 of the observations) and a test set (1/3).
Hardware Specification	No	No specific hardware details (like GPU/CPU models or specific computer configurations) used for running experiments are mentioned in the paper.
Software Dependencies	No	The paper mentions software components and methods like 'k-nearest neighbors (k NN)' and 'linear Fisher discriminant', but no specific version numbers for any software dependencies are provided, which are necessary for reproducible descriptions.
Experiment Setup	Yes	RMH requires determining the values of two hyperparameters: the redundancy threshold r (0 < r < 1 typically close to 1), and the relevance threshold s (0 < s < 1 typically close to 0). Through extensive simulations we have observed that RMH is quite robust for a wide range of appropriate values of these parameters. In particular, the results are very similar for values of r in the interval [0.75, 0.95]. The predictive accuracy is somewhat more sensitive to the choice of s: If the value of s is too small, irrelevant variables can be selected. If s is too large, it is possible that relevant points are excluded. For most of the experiments performed, the optimal values of s are between 0.025 and 0.1. In view of these observations, the experiments are made using r = 0.8. The value of s is selected from the set {0.025, 0.05, 0.1} by 10-fold CV. A more careful determination of r and s is beneﬁcial, especially in some extreme problems (e.g., with very smooth or with rapidly-varying trajectories). In RMH, the number of selected variables, which is not determined beforehand, depends indirectly on the values of r and s. In the other methods, the number of selected variables is determined using 10-fold CV, with maximum of 30.