Canopy Fast Sampling with Cover Trees

Authors: Manzil Zaheer, Satwik Kottur, Amr Ahmed, José Moura, Alex Smola

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we propose Canopy, a sampler based on Cover Trees that is exact, has guaranteed runtime logarithmic in the number of atoms, and is provably polynomial in the inherent dimensionality of the underlying parameter space. In other words, the algorithm is as fast as search over a hierarchical data structure. We provide theory for Canopy and demonstrate its effectiveness on both synthetic and real datasets, consisting of over 100 million images. We now present empirical studies for our fast sampling techniques in order to establish that (i) Canopy is fast (Sec. 5.1), (ii) Canopy is accurate (Sec. 5.2), and (iii) it opens new avenues for data exploration and unsupervised learning (Sec. 5.3), previously unthinkable.
Researcher Affiliation Collaboration 1Carnegie Mellon University, USA 2Amazon Web Services, USA 3Google Inc, USA. Correspondence to: Manzil Zaheer <manzil@cmu.edu>.
Pseudocode No The paper describes algorithms and procedures in text, but it does not include formally structured pseudocode or algorithm blocks.
Open Source Code No The paper refers to 'publicly available pre-trained model of 200 layers2' with a GitHub link (github.com/facebook/fb.resnet.torch), but this is for a third-party model used, not the source code for the Canopy method described in the paper. There is no statement indicating that the authors' own code for Canopy is open-sourced.
Open Datasets Yes Datasets We use two benchmark image datasets MNIST8m (Loosli et al., 2007) and CIFAR-100 (Krizhevsky & Hinton, 2009).
Dataset Splits No The paper mentions using a 'TRAIN' set to learn parameters and a 'TEST' set for evaluation, but it does not explicitly specify a separate validation set or provide details about validation splits.
Hardware Specification Yes We run our experiments on a cluster of 16 Amazon EC2 c4.8xlarge nodes connected through 10Gb/s Ethernet. There are 36 virtual threads per node and 60GB of memory.
Software Dependencies No The paper states: 'All the algorithms are implemented multithreaded in simple C++11'. While C++11 provides a version, it does not list any specific libraries, frameworks, or solvers with their version numbers beyond the programming language itself, which is insufficient for full reproducibility.
Experiment Setup Yes All our experiments use cover tree based initializations.