Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Linear-Time Learning on Distributions with Approximate Kernel Embeddings
Authors: Danica Sutherland, Junier Oliva, Barnabás Póczos, Jeff Schneider
AAAI 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide an analysis of the approximation error in using our proposed random features, and show empirically the quality of our approximation both in estimating a Gram matrix and in solving learning tasks in real-world and synthetic data. |
| Researcher Affiliation | Academia | Dougal J. Sutherland and Junier B. Oliva and Barnab as P oczos and Jeff Schneider Carnegie Mellon University EMAIL |
| Pseudocode | Yes | The algorithm for computing features {z(A(pi))}N i=1 for a set of distributions {pi}N i=1, given sample sets {χi}N i=1 where χi = {X(i) j [0, 1]ℓ}ni j=1 iid pi, is thus: 1. Draw M scalars λj iid μ Z and D/2 vectors ωr iid N(0, σ 2I2M|V |), in O(M |V | D) time. 2. For each of the N input distributions i: (a) Compute a kernel density estimate from χi, ˆpi(uj) for each uj in (10), in O(nine) time. (b) Compute ˆA(ˆpi) using a numerical integration estimate as in (10), in O(M |V | ne) time. (c) Get the RKS features, z( ˆA(ˆpi)), in O(M |V | D) time. |
| Open Source Code | No | The paper mentions a GitHub link in footnote 3: 'github.com/dougalsutherland/skl-groups/', but explicitly states it's for the KL kernel ('as did the KL kernel3'), not the authors' main contribution (HDD embeddings). The paper also states: 'while the HDD embeddings used a simple Matlab implementation.', indicating their own code is not openly provided. |
| Open Datasets | Yes | We took the cat and dog classes from the CIFAR-10 dataset (Krizhevsky and Hinton 2009). We consider the Scene-15 dataset (Lazebnik, Schmid, and Ponce 2006). |
| Dataset Splits | Yes | Throughout these experiments we use M = 5, |V | = 10ℓ (selected as rules of thumb; larger values did not improve performance), and use a validation set (10% of the training set) to choose bandwidths for KDE and the RBF kernel as well as model regularization parameters. |
| Hardware Specification | No | No specific hardware details (like GPU or CPU models, or memory specifications) were provided for the experiments. |
| Software Dependencies | No | The paper mentions 'a simple Matlab implementation' for HDD embeddings and that SVM classifiers were used 'from LIBLINEAR (Fan et al. 2008, for the embeddings) or LIBSVM (Chang and Lin 2011, for the KL kernel)', but no specific version numbers for Matlab, LIBLINEAR, or LIBSVM are provided. |
| Experiment Setup | Yes | Throughout these experiments we use M = 5, |V | = 10ℓ (selected as rules of thumb; larger values did not improve performance), and use a validation set (10% of the training set) to choose bandwidths for KDE and the RBF kernel as well as model regularization parameters. Except in the scene classification experiments, the histogram methods used 10 bins per dimension; performance with other values was not better. The KL estimator used the fourth nearest neighbor. ...we use D = 5 000. ...with D = 7 000. |