Alpha-Beta Divergences Discover Micro and Macro Structures in Data

Authors: Karthik Narayan, Ali Punjani, Pieter Abbeel

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study this relationship, theoretically and through an empirical analysis over 10 datasets. Our works shows how the α and β parameters of the generalized alpha-beta divergence can be chosen to discover hidden macrostructures (categories, e.g. birds) or microstructures (fine-grained classes, e.g. toucans). Our method, which generalizes t-SNE (van der Maaten, 2008), allows us to discover such structure without extensive grid searches over (α, β) due to our theoretical analysis: such structure is apparent with particular choices of (α, β) that generalize across datasets. We also discuss efficient parallel CPU and GPU schemes which are non-trivial due to the tree-structures employed in optimization and the large datasets that do not fully fit into GPU memory. Our method runs 20x faster than the fastest published code (Vladymyrov & Carreira-Perpin an, 2014). We conclude with detailed case studies on the following very large datasets: ILSVRC 2012, a standard computer vision dataset with 1.2M images; SUSY, a particle physics dataset with 5M instances; and HIGGS, another particle physics dataset with 11M instances. This represents the largest published visualization attained by SNE methods.
Researcher Affiliation Academia Karthik Narayan KARTHIK.NARAYAN@BERKELEY.EDU University of California, Berkeley, CA, 94720, USA Ali Punjani ALIPUNJANI@CS.TORONTO.EDU University of Toronto, ON M5S, CANADA Pieter Abbeel PABBEEL@CS.BERKELEY.EDU University of California, Berkeley, CA, 94720, USA
Pseudocode No The paper describes computational steps and kernels in detail in Section 5.1 (Parallel GPU Gradients), but these are described in prose, not presented as structured pseudocode or algorithm blocks.
Open Source Code Yes We have open-sourced our visualization code: http://rll.berkeley.edu/absne/.
Open Datasets Yes MNIST (Le Cun & Cortes, 1998), a dataset of 70K 28 28 grayscale images depicting handwritten digits 0-9 and CIFAR-10 (Krizhevsky & Hinton, 2009), a dataset of 32 32 color images depicting 10 distinct object classes. ... ILSVRC 2012 (Deng et al., 2009) features a training set of 1.28M images of varying size, validation set of 50K images, and 1000 object categories. ... We return to the HIGGS dataset (Baldi et al., 2014)... We explore the SUSY dataset (Baldi et al., 2014)...
Dataset Splits Yes ILSVRC 2012 (Deng et al., 2009) features a training set of 1.28M images of varying size, validation set of 50K images, and 1000 object categories. ... We visualize the test set to avoid having training labels directly influence the embedding. ... Applying PCA to center and reduce each dataset to 100 dimensions, we ran ABSNE under various (α, λ) for both datasets with perplexity 30 (Figure 3). We initialize all yi in experiments with the same random seed and run exactly 1000 iterations of gradient descent per configuration... For HIGGS and SUSY, we first initialize and optimize with 0.5M random instances. Upon convergence, we place another 0.5M new random instances near their nearest neighbor in the original dataset with isotropic Gaussian noise with variance 0.01.
Hardware Specification Yes All experiments in this paper employ a machine with 2x Intel Xeon X5570 CPUs (8 cores total, 2.93 GHz), 64GB memory, and an NVIDIA Tesla K40c graphics card.
Software Dependencies No For CIFAR-10, we train a 3-layer convolutional neural network with the CUDACONVNET (Krizhevsky, 2011) architecture using Caffe (Jia et al., 2014)... We perform the reduction through the open-source Thrust library (Bell & Hoberock, 2011). No explicit version numbers for Caffe, CUDACONVNET, or Thrust are provided.
Experiment Setup Yes We initialize all yi from a 2D isotropic Gaussian with variance 10 4 and (2) update each yi via gradient descent (GD) with momentum (step size 200). For the first 250 descent iterations, we use momentum 0.5 and multiply all Pij values by a user-defined constant α = 12. For the last 750 iterations, we use momentum 0.8. We use a per-parameter adaptive learning rate scheme to speed up GD convergence (Jacobs, 1988)... We initialize all yi in experiments with the same random seed and run exactly 1000 iterations of gradient descent per configuration... with perplexity 30.