Regularising Non-linear Models Using Feature Side-information
Authors: Amina Mollaysa, Pablo Strasser, Alexandros Kalousis
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform experiments on a number of benchmark datasets which show significant predictive performance gains, over a number of baselines, as a result of the exploitation of the side-information. |
| Researcher Affiliation | Academia | 1University of Applied Sciences, Western Switzerland; University of Geneva. |
| Pseudocode | No | The paper describes methods analytically and stochastically, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide any information or link indicating the availability of open-source code for the described methodology. |
| Open Datasets | Yes | We evaluated both approaches on the eight document classification datasets used in (Kusner et al., 2015). As feature side-information we use the word2vec representation of the words which have a dimensionality of 300 (Mikolov et al., 2013). |
| Dataset Splits | Yes | We used early stopping where we keep 20% of the training data as the validation set. For those datasets without a predefined train/test split (BBCsport, Twitter, Classic, Amazon, Recipe), we use five-fold cross validation and report the average error. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments (e.g., GPU/CPU models, memory details). |
| Software Dependencies | No | The paper mentions using Adam (Kingma & Ba, 2014) and word2vec (Mikolov et al., 2013) but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We used α = 0.001, β1 = 0.9, β2 = 0.999 for one hidden layer networks, and α = 0.0001 for the networks with more hidden layers. We initialize all networks parameters using (Glorot & Bengio, 2010). For the analytical model we set the maximum number of iterations to 5000. For the stochastic model we set the maximum number of iterations to 10000 for the one layer networks and to 20000 for networks with more layers. We used early stopping where we keep 20% of the training data as the validation set. We select the λ hyperparameters of of AN, ST, and ℓ2 from {10k|k = 3, . . . , 3}; we select the λ of dropout from [0.1, 0.2, 0.3, 0.4, 0.5]. We set the c in the augmentation process, that controls the size of the neigborhood within which the output constraints should hold, to one. For the analytical model we set the mini-batch size m to five. For the stochastic model, as well as for all the baseline models, we set the mini-batch size to 20. In the experiments we set p = 5 |