A fast, universal algorithm to learn parametric nonlinear embeddings

Authors: Miguel A. Carreira-Perpinan, Max Vladymyrov

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments confirm that MAC finds optima as good as those of the conventional optimization based on chain-rule gradients, but that it is faster (particularly if using N-body methods). We demonstrate this with different embedding objectives (the elastic embedding and t-SNE) and mappings (linear and neural net). We report on a representative subset of experiments.
Researcher Affiliation Collaboration Miguel A. Carreira-Perpi n an EECS, University of California, Merced http://eecs.ucmerced.edu Max Vladymyrov UC Merced and Yahoo Labs maxv@yahoo-inc.com
Pseudocode No The paper describes the algorithmic steps and equations (e.g., alternating optimization over F and Z) but does not contain a formally structured pseudocode block or algorithm listing.
Open Source Code No The paper mentions using and implementing various components (e.g., 'For the chain-rule PE optimization we used the code from [25]', 'we implemented our own chain-rule PE optimizer'), but it does not provide a concrete statement or link for the open-source code of the methodology described in this paper.
Open Datasets Yes Illustrative example The simple example of fig. 1 shows the different embedding types described in the paper. We use the COIL-20 dataset, containing rotation sequences of 20 physical objects every 5 degrees, each a grayscale image of 128 128 pixels, total N = 1 440 points in 16 384 dimensions; thus, each object traces a closed loop in pixel space. We produce 2D embeddings of 3 objects, using the elastic embedding (EE) [5]. ... Cost of the iterations Fig. 2(left) shows, as a function of the number of data points N (using a 3D Swissroll dataset)... Different embedding objectives and mapping families ... We apply each combination to learn a parametric embedding for the MNIST dataset, containing N = 60 000 images of handwritten digit images.
Dataset Splits No The paper describes hyperparameter tuning and stopping criteria (e.g., 'stopping when the relative error is less than 10 2') but does not explicitly specify train/validation/test dataset splits with percentages, absolute counts, or references to predefined validation splits.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions 'The runtimes were (excluding the time taken by pretraining, 40 )'.
Software Dependencies No The paper mentions several methods and refers to code from other papers (e.g., 'For the chain-rule PE optimization we used the code from [25]', 'for the EE and t-SNE (free) embeddings we use the spectral direction [29]'), but it does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4).
Experiment Setup Yes The optimization details are as follows. We preprocess the data using PCA projecting to 15 dimensions... The free embedding was optimized using the spectral direction [29] until consecutive iterates differed by a relative error less than 10 3. We increased µ from 0.003 to 0.015 with a step of 0.001 (12 µ values) and did 40 iterations for each µ value. The Z step uses the spectral direction, stopping when the relative error is less than 10 2. ... We use t-SNE and a sigmoidal neural net with an architecture 3 100 500 2. ... For the neural net, we replicated the setup of [25]. This uses a neural net with an architecture (28 28) 500 500 2000 2, initialized with pretraining as de-scribed in [22] and [25]. For the chain-rule PE optimization we used the code from [25]. ... Each minibatch was trained with 3 CG iterations and a total of 30 epochs. For MAC, we used µ {10 7, 5 10 7, 10 6, 5 10 6, 10 5, 5 10 5}, optimizing until the objective function decrease (before the Z step and after the F step) was less than a relative error of 10 3. ... The Z step (like the free embedding) uses the spectral direction with a fixed step size γ = 0.05, using 10 iterations of linear conjugate gradients to solve the linear system (L + µ 2 I)P = G, and using warm-start... The gradient G of the free embedding is approximated in O(N log N) using the Barnes-Hut method with accuracy θ = 1.5. ... For the F step we used stochastic gradient descent with minibatches of 100 points, step size 10 3 and momentum rate 0.9, and trained for 5 epochs for the first 3 values of µ and for 3 epochs for the rest. For the linear mapping F(y) = Ay, we implemented our own chain-rule PE optimizer with gradient descent and backtracking line search for 30 iterations. In MAC, we used 10 µ values spaced logarithmically from 10 2 to 102, optimizing at each µ value until the objective function decrease was less than a relative error of 10 4. Both the Z step and the free embedding use the spectral direction with a fixed step size γ = 0.01. We stop optimizing them when the relative error between consecutive embeddings is less than 10 4. The gradient is approximated using fast multipole methods with accuracy p = 6 (the number of terms in the truncated series).