Relative gradient optimization of the Jacobian term in unsupervised deep learning
Authors: Luigi Gresele, Giancarlo Fissore, Adrián Javaloy, Bernhard Schölkopf, Aapo Hyvarinen
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify empirically the computational speedup our method provides in section 5. |
| Researcher Affiliation | Academia | 1Max Planck Institute for Intelligent Systems, Tübingen, Germany 2Max Planck Institute for Biological Cybernetics, Tübingen, Germany 3 Université Paris-Saclay, Inria, Inria Saclay-Île-de-France, 91120, Palaiseau, France 4 Université Paris-Saclay, CNRS, Laboratoire de recherche en informatique, 91405, Orsay, France 5 Dept of Computer Science, University of Helsinki, Finland |
| Pseudocode | No | The paper describes procedures and mathematical derivations in text and equations, but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | The code used for our experiments can be found at https://github.com/fissoreg/ relative-gradient-jacobian. |
| Open Datasets | Yes | unconditional density estimation on four different UCI datasets [16] and a dataset of natural image patches (BSDS300) [41], as well as on MNIST [37]. |
| Dataset Splits | Yes | We trained for 100 epochs, and picked the best performing model on the validation set. |
| Hardware Specification | Yes | The main comparison is run on a Tesla P100 Nvidia GPU. |
| Software Dependencies | No | The paper mentions using the "JAX package [10]" for automatic differentiation in a comparison experiment, but does not provide specific version numbers for JAX or other software libraries/dependencies used for their own method's implementation. |
| Experiment Setup | Yes | The results in Table 1 correspond to networks with 3 fully connected hidden layers with 1024 units each, using a smooth version of leaky-ReLU activation functions. We performed an initial grid search on the learning rate in the range [10^-3, 10^-5], and used an Adam optimizer [38] with β1 = 0.9, β2 = 0.999. We trained for 100 epochs, and picked the best performing model on the validation set. We did not use any batch normalization, dropout, or learning rate scheduling. |