Regularising Non-linear Models Using Feature Side-information

Authors: Amina Mollaysa, Pablo Strasser, Alexandros Kalousis

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments on a number of benchmark datasets which show significant predictive performance gains, over a number of baselines, as a result of the exploitation of the side-information.
Researcher Affiliation Academia 1University of Applied Sciences, Western Switzerland; University of Geneva.
Pseudocode No The paper describes methods analytically and stochastically, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not provide any information or link indicating the availability of open-source code for the described methodology.
Open Datasets Yes We evaluated both approaches on the eight document classification datasets used in (Kusner et al., 2015). As feature side-information we use the word2vec representation of the words which have a dimensionality of 300 (Mikolov et al., 2013).
Dataset Splits Yes We used early stopping where we keep 20% of the training data as the validation set. For those datasets without a predefined train/test split (BBCsport, Twitter, Classic, Amazon, Recipe), we use five-fold cross validation and report the average error.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments (e.g., GPU/CPU models, memory details).
Software Dependencies No The paper mentions using Adam (Kingma & Ba, 2014) and word2vec (Mikolov et al., 2013) but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We used α = 0.001, β1 = 0.9, β2 = 0.999 for one hidden layer networks, and α = 0.0001 for the networks with more hidden layers. We initialize all networks parameters using (Glorot & Bengio, 2010). For the analytical model we set the maximum number of iterations to 5000. For the stochastic model we set the maximum number of iterations to 10000 for the one layer networks and to 20000 for networks with more layers. We used early stopping where we keep 20% of the training data as the validation set. We select the λ hyperparameters of of AN, ST, and ℓ2 from {10k|k = 3, . . . , 3}; we select the λ of dropout from [0.1, 0.2, 0.3, 0.4, 0.5]. We set the c in the augmentation process, that controls the size of the neigborhood within which the output constraints should hold, to one. For the analytical model we set the mini-batch size m to five. For the stochastic model, as well as for all the baseline models, we set the mini-batch size to 20. In the experiments we set p = 5