reproducibilityindex.ai

Feature Engineering for Predictive Modeling Using Reinforcement Learning

Authors: Udayan Khurana, Horst Samulowitz, Deepak Turaga

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We tested the impact of our FE on a 48 publicly available datasets (different from the datasets used for training) from a variety of domains, and of various sizes. We report the accuracy of (a) base dataset; (b) Our FE routine with RL1, Bmax = 100; (c) Expansion-reduction implementation where all transformations are ﬁrst applied separately and add to original columns, followed by a feature selection routine; (d) Random: randomly applying a transformation function to a random feature(s) and adding the result to the original dataset and measuring the CV performance; this is repeated 100 times and ﬁnally, we consider all the new features whose cases showed an improvement in performance, along with the original features to train a model (e) Tree-Heur: our implementation of Cognito s (Khurana et al. 2016b) global search heuristic for 100 nodes. We used Random Forest with default parameters as our learning algorithm for all the comparisons as it gave us the strongest baseline (no FE) average. A 5-fold cross validation using random stratiﬁed sampling was used. The results for a representative set of 24 of those datasets (due to lack of space) are captured in Table 1.
Researcher Affiliation	Industry	Udayan Khurana ukhurana@us.ibm.com IBM Research AI Horst Samulowitz samulowitz@us.ibm.com IBM Research AI Deepak Turaga turaga@us.ibm.com IBM Research AI
Pseudocode	Yes	Algorithm 1 outlines the general methodology for exploration. At each step, an estimated reward from each possible move, R(Gi, n, t, i Bmax ) is used to rank the options of actions available at each given state of the transformation graph Gi, i [0, Bmax), where Bmax is the overall allocated budget in number of steps.
Open Source Code	No	The paper does not provide any explicit statements about open-source code availability or links to a code repository for the described methodology.
Open Datasets	Yes	We tested the impact of our FE on a 48 publicly available datasets (different from the datasets used for training) from a variety of domains, and of various sizes. We report the accuracy of (a) base dataset; (b) Our FE routine with RL1, Bmax = 100; (c) Expansion-reduction implementation where all transformations are ﬁrst applied separately and add to original columns, followed by a feature selection routine; (d) Random: randomly applying a transformation function to a random feature(s) and adding the result to the original dataset and measuring the CV performance; this is repeated 100 times and ﬁnally, we consider all the new features whose cases showed an improvement in performance, along with the original features to train a model (e) Tree-Heur: our implementation of Cognito s (Khurana et al. 2016b) global search heuristic for 100 nodes. We used Random Forest with default parameters as our learning algorithm for all the comparisons as it gave us the strongest baseline (no FE) average. A 5-fold cross validation using random stratiﬁed sampling was used. The results for a representative set of 24 of those datasets (due to lack of space) are captured in Table 1.
Dataset Splits	Yes	A 5-fold cross validation using random stratiﬁed sampling was used.
Hardware Specification	Yes	For reference to runtime, it took the Bikeshare DC dataset 4 minutes, 40 seconds to run for 100 nodes for our FE, on a single thread on a 2.8GHz processor.
Software Dependencies	No	The paper does not provide specific version numbers for any software components used in the experiments. It only mentions using "Random Forest with default parameters" but no library name or version.
Experiment Setup	Yes	We used the discount factor, γ = 0.99, and learning rate parameter, α = 0.05. The weight vectors, wc or w, each of size 12, were initialized with 1 s. The training example steps are drawn randomly with the probability ϵ = 0.15 and the current policy with probability 1 ϵ.