MISSION: Ultra Large-Scale Feature Selection using Count-Sketches
Authors: Amirali Aghazadeh, Ryan Spring, Daniel Lejeune, Gautam Dasarathy, Anshumali Shrivastava, baraniuk
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We designed a set of simulations to evaluate MISSION in a controlled setting. All experiments were performed on a single machine, 2x Intel Xeon E5-2660 v4 processors (28 cores / 56 threads) with 512 GB of memory. The code1 for training and running our randomized-hashing approach is available online. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering, Stanford University, Stanford, California 2Department of Computer Science, Rice University, Houston, Texas 3Department of Electrical and Computer Engineering, Rice University, Houston, Texas. |
| Pseudocode | Yes | Algorithm 1 MISSION |
| Open Source Code | Yes | The code1 for training and running our randomized-hashing approach is available online. 1https://github.com/rdspring1/MISSION |
| Open Datasets | Yes | Datasets: We used four datasets in the experiments: 1) KDD2012, 2) RCV1, 3) Webspam Trigram, 4) DNA2. The statistics of these datasets are summarized in Table 2. 2http://projects.cbio.mines-paristech.fr/largescalemetagenomics/ 3https://www.kaggle.com/c/criteo-display-ad-challenge |
| Dataset Splits | No | The paper provides 'Train Size' and 'Test Size' for the datasets but does not explicitly mention a 'validation' split or describe its configuration. |
| Hardware Specification | Yes | All experiments were performed on a single machine, 2x Intel Xeon E5-2660 v4 processors (28 cores / 56 threads) with 512 GB of memory. |
| Software Dependencies | No | The paper states 'The code1 for training and running our randomized-hashing approach is available online.' but does not specify particular software dependencies with version numbers (e.g., Python, PyTorch, etc.). |
| Experiment Setup | Yes | For all methods, we used the logistic loss for binary classification and the cross-entropy loss for multi-class classification. For all the experiments, the Count-Sketch data structure used 3 hash functions, and the model weights were divided equally among the hash arrays. All the methods were trained for a single epoch with a learning rate of 0.5. |