Canopy Fast Sampling with Cover Trees
Authors: Manzil Zaheer, Satwik Kottur, Amr Ahmed, José Moura, Alex Smola
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose Canopy, a sampler based on Cover Trees that is exact, has guaranteed runtime logarithmic in the number of atoms, and is provably polynomial in the inherent dimensionality of the underlying parameter space. In other words, the algorithm is as fast as search over a hierarchical data structure. We provide theory for Canopy and demonstrate its effectiveness on both synthetic and real datasets, consisting of over 100 million images. We now present empirical studies for our fast sampling techniques in order to establish that (i) Canopy is fast (Sec. 5.1), (ii) Canopy is accurate (Sec. 5.2), and (iii) it opens new avenues for data exploration and unsupervised learning (Sec. 5.3), previously unthinkable. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University, USA 2Amazon Web Services, USA 3Google Inc, USA. Correspondence to: Manzil Zaheer <manzil@cmu.edu>. |
| Pseudocode | No | The paper describes algorithms and procedures in text, but it does not include formally structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to 'publicly available pre-trained model of 200 layers2' with a GitHub link (github.com/facebook/fb.resnet.torch), but this is for a third-party model used, not the source code for the Canopy method described in the paper. There is no statement indicating that the authors' own code for Canopy is open-sourced. |
| Open Datasets | Yes | Datasets We use two benchmark image datasets MNIST8m (Loosli et al., 2007) and CIFAR-100 (Krizhevsky & Hinton, 2009). |
| Dataset Splits | No | The paper mentions using a 'TRAIN' set to learn parameters and a 'TEST' set for evaluation, but it does not explicitly specify a separate validation set or provide details about validation splits. |
| Hardware Specification | Yes | We run our experiments on a cluster of 16 Amazon EC2 c4.8xlarge nodes connected through 10Gb/s Ethernet. There are 36 virtual threads per node and 60GB of memory. |
| Software Dependencies | No | The paper states: 'All the algorithms are implemented multithreaded in simple C++11'. While C++11 provides a version, it does not list any specific libraries, frameworks, or solvers with their version numbers beyond the programming language itself, which is insufficient for full reproducibility. |
| Experiment Setup | Yes | All our experiments use cover tree based initializations. |