Efficient Transformed Gaussian Processes for Non-Stationary Dependent Multi-class Classification
Authors: Juan Maroñas, Daniel Hernández-Lobato
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that ETGP, in general, outperforms state-of-the-art methods for multi-class classification based on GPs, and has a lower computational cost (around one order of magnitude smaller). |
| Researcher Affiliation | Collaboration | 1Machine Learning Group, Universidad Aut onoma de Madrid, Madrid, Spain 2Work done previous to joining Cognizant, Madrid, Spain. |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | No | The code will be released in this repository https://github.com/ jmaronas/Efficient_Multiclass_Gaussian_ Processes_using_TGP3. This repository is being used for other projects so at the moment stays closed. Drop us an email and we will happily share the code. |
| Open Datasets | Yes | We evaluate ETGP in 5 UCI datasets (Lichman, 2013) (see Fig. 3 for details). |
| Dataset Splits | Yes | For the ETGP model selection was done using a validation split with a different number of points per dataset. This information is provided by looking at the code that loads the data. |
| Hardware Specification | No | The paper mentions 'computer cluster' but does not provide specific hardware details (e.g., GPU/CPU models, memory, or detailed specifications). |
| Software Dependencies | Yes | Experiments are run using GPFLOW (Matthews et al., 2017; van der Wilk et al., 2020). We used GPFLOW version 2.1.3 and have found that the current issue also appears in the last stable version 2.5.2. |
| Experiment Setup | Yes | Common to all experiments is the following information. Experiments are run using GPFLOW (Matthews et al., 2017; van der Wilk et al., 2020). Unless mentioned we use default GPFLOW parameters. Inducing points are initialized using Kmeans algorithm for vowel,absenteeism and avila with 10 reinitializations and parallel Kmeans for characterfont and devangari with 3 reinitializations. The length scale of RBF kernels was initialized to 2.0 and the mixing matrix randomly. Non-stationary kernels are initialized with a length scale of 2.0 for the arcosine and with an identity matrix for the Neural Network kernel. All kernels employ automatic relevance determination if possible. The variational mean is initialized to zero and the Cholesky factorization of the variational covariance to the identity matrix multiplied by 1e 5. In all the experiments the model used to compute the train/valid/test metrics was the model corresponding to the epoch with the best (highest) ELBO. We use Adam optimizer with a batch size of 10000. This implies that on the small datasets we are performing full batch gradient descent. This in addition with our deterministic initialization procedure removes most of the randomness in the results, removing the necessity of running several times the same experiments with different seeds. Note that in the ETGP, although the parameters of the NN are initialized randomly, we run an initialization procedure several times removing the influence from the random initialization. On the big datasets we perform stochastic gradient descent however running the models several times to remove possible noise in the stochastic gradient algorithm is unfeasible especially for SVGPs where some of the models took 4 days to run in our computer cluster. For all the SVGP models we run models with learning rate values of 0.01 and 0.001. For certain choices of hyper-parameters if we saw that 0.01 was providing better results than 0.001 we keep searching just with 0.01. In some cases we also look for other learning rates e.g. 0.05 in light of finding the best baseline model to compare against. We run either 10000 or 15000 epochs for vowel,absenteeism and avila and 100, 200, 500, 1000, 2000 epochs for characterfont and devangari. For these last two dataset we do not always launch 2000 epochs, and only did it if we found a big increase in performance from the run with 500 to 1000 epochs. Note that training times are averages over epochs and we do not provide the full time of the experiment (which in turns imply that the ETGP is even faster since we run them just for 500 epochs). We run models with number of inducing points {100, 50, 20} for (vowel,absenteeism and avila) and 100 for characterfont and devangari. We also experiment with the parameters of the covariance (including the mixing matrix parameters in RBFCORR) being frozen for 2000 (vowel,absenteeism and avila) or 50 (characterfont and devangari) epochs or trained end to end, i.e. no freezing is applied, following Maro nas et al. (2021); Hensman et al. (2015). Once all these experiments were launched, we select for each set of kernel, number of inducing points etc, the model giving the best performance by directly looking at the test set, in order to evaluate the proposed model in the most optimistic situation for each SVGP baseline. For the ETGP model selection was done using a validation split with a different number of points per dataset. This information is provided by looking at the code that loads the data. For the ETGP all the models are run for 15000 epochs for vowel,absenteeism and avila and 500 epochs for characterfont and devangari (which implies that the total training time of our models is even faster), and the best selected model on validation for 100 inducing points, is run for 50 and 20, in contrast with SVGP where each 50 and 20 inducing points model can have its own set of training hyperparameters. Bayesian flows are trained with 1 Monte Carlo dropout sample and evaluated (i.e. posterior predictive computation) using 20 dropout samples. The learning rate experimented was 0.01 and 0.001 and all the parameters are trained from the beginning without freezing. The NN architectures were chosen depending on the input size of the dataset. All these architectures have an input layer equal to the dimensionality of the data and an output layer given by the number of parameters of the flow multiplied by the number of classes. We tested LINEAR, SAL (Rios & Tobar, 2019) with length 3 and TANH (Snelson et al., 2003) with length 3 and 4 elements in the linear combination. The length of the flow corresponds to the value of K in the flow parameterization, i.e. it is the number of, e.g. individual SAL transformations, being concatenated. All the NN use hyperbolic tangent activation function and we use a variance of a Gaussian prior over flow parameters set to 5000, 50000, 50000 which corresponds to a weight decay factor of 1e 4, 1e 5, 1e 6 without considering the constant value of the Gaussian prior that depends on the number of parameters. For vowel,absenteeism and avila we test networks with 0, 1, 2 hidden layers with 25, 50, 100 neurons per layer and with dropout probabilities of 0.25, 0.5, 0.75 except avila that only uses 0.25, 0.5. We tested 0.75 to see if higher uncertainty in the NN posterior could help in regularizing the datasets with fewer number of training points. For devangari we test 0, 1, 2 hidden layers with 512, 1024 neurons per layer. We also tested a projection network of 0, 1 hidden layers with 512 neurons per hidden layer and 256 neurons per output layer. The output of this projection network is feed into another neural network that maps the 256 dimensions to the number of parameters. This second NN has 0, 1 hidden layers with 256, 128 neurons per layer. All these networks have a dropout probability of 0.5. For characterfont we also use a dropout probability of 0.5 and NN with 0, 1, 2 hidden layers with 256 neurons per layer. We also test projection networks of 0, 1, 2 hidden layers with 512, 256 neurons per hidden layer and output layer of 256 neurons. This is then feed into another neural network with 0, 1, 2 hidden layers and 256 neurons per layer. Regarding the initialization of the flows we follow Maro nas et al. (2021) and initialize the flows to the identity by first learning the identity mapping using a non-input dependent flow, and then learning the parameters of the neural network to match each point in the training dataset to the learned non-input dependent parameters. Both initialization procedures are launched 5 times with a learning rate of 0.05 and Adam optimizer for any dataset and flow architecture. The input-dependent initialization is run for 1000 epochs in vowel,absenteeism and avila and for 100 epochs in characterfont and devangari. Some preliminary runs were done to test if these hyperparameters allow the flow to be properly initialized and then all these parameters were used for any flow initialization in our validation search without further analysis. We found in general that with fewer epochs the flow could be also initialized properly, but decided to run a considerable number of initialization epochs. We highlight that this procedure can be done in parallel to Kmeans initialization, for readers concerned with the training time associated with this initialization procedure. |