Searching for Machine Learning Pipelines Using a Context-Free Grammar
Authors: Radu Marinescu, Akihiro Kishimoto, Parikshit Ram, Ambrish Rawat, Martin Wistuba, Paulito P. Palmes, Adi Botea8902-8911
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various benchmark datasets show that our approach is highly competitive and often outperforms existing Auto ML systems.Our results show conclusively that PIPER is highly competitive and often outperforms existing state-of-the-art Auto ML systems. |
| Researcher Affiliation | Industry | Radu Marinescu1, Akihiro Kishimoto1, Parikshit Ram1, Ambrish Rawat1, Martin Wistuba1, Paulito P. Palmes1, Adi Botea2 1 IBM Research 2 Eaton radu.marinescu@ie.ibm.com, Akihiro.Kishimoto@ibm.com, parikshit.ram@ibm.com, Ambrish.Rawat@ie.ibm.com, Martin.Wistuba@ibm.com, paulpalmes@ie.ibm.com, adi.botea@eaton.com |
| Pseudocode | Yes | Algorithm 1 PIPER: greedy best-first search for pipeline structure generation and optimization |
| Open Source Code | No | The paper mentions availability for third-party tools like Hyperopt and baselines (TPOT, MOSAIC) but does not provide a direct link or explicit statement for the open-source release of the PIPER system itself. |
| Open Datasets | Yes | We considered 15 datasets from the Open ML repository (Vanschoren et al. 2013) and We consider a collection of 504 binary and multi-class classification datasets from the Open ML repository (Vanschoren et al. 2013) |
| Dataset Splits | No | The paper specifies a '70/30 split ratio for training and test data' but does not explicitly mention a separate validation dataset split. |
| Hardware Specification | Yes | All algorithms were implemented in Python 3.6 using the scikit-learn algorithms (Pedregosa, Varoquaux, and Gramfort 2011) and we ran all experiments on a 2.6GHz CPU with 20GB of RAM. |
| Software Dependencies | No | The paper states 'implemented in Python 3.6 using the scikit-learn algorithms' but does not provide version numbers for scikit-learn or other key libraries used in the implementation, only Python's version. |
| Experiment Setup | Yes | The total computational budget is set to 4 hours for each dataset and we average the performance over 10 independent runs.ADMM was configured with combinatorial multi-arm bandits for solving the discrete optimization sub-problem and Bayesian optimization for the continuous one, using 25 iterations per sub-problem as suggested in (Liu et al. 2020).for TPOT we set the population size to 100.PIPER and PIPERX allocate the first 20 minutes (1200 seconds) to the greedy best-first search for finding the most promising DAG-shaped pipeline structure while the remaining time is used for optimizing the pipeline structure found. PIPERZ allocates at most 10 minutes (600 seconds) for each terminal pipeline optimization and continues the search until the entire time budget is exhausted. |