Dimensionality Reduction for Representing the Knowledge of Probabilistic Models
Authors: Marc T Law, Jake Snell, Amir-massoud Farahmand, Raquel Urtasun, Richard S Zemel
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally show that our framework improves generalization performance to unseen categories in zero-shot learning. We evaluate the relevance of our method in two types of experiments. The first learns low-dimensional representations for visualization to better interpret pre-trained deep models. The second experiment exploits the probability scores generated by a pre-trained classifier in the zero-shot learning context; these probability scores are used as supervision to improve performance on novel categories. |
| Researcher Affiliation | Collaboration | Marc T. Law & Jake Snell University of Toronto, Canada Vector Institute, Canada Amir-massoud Farahmand Vector Institute, Canada Raquel Urtasun University of Toronto, Canada Vector Institute, Canada Uber ATG, Canada Richard S. Zemel University of Toronto, Canada Vector Institute, Canada CIFAR Senior Fellow |
| Pseudocode | Yes | Algorithm 1 Dimensionality Reduction of Probabilistic Representations (DRPR) input : Set of training examples (e.g., images) in X and their target probability scores (e.g., classification scores w.r.t. k training categories), nonlinear mapping gθ parameterized by parameters θ, number of iterations t 1: for iteration 1 to t do 2: Randomly sample n training examples x1, , xn X and create target assignment matrix Y Yn k containing the target probability scores y1, , yn (i.e., Y = [y1, , yn] Yn k) 3: Create matrix F [f1, , fn] Vn such that i, fi = gθ(xi) 4: Create matrix of centers M diag(Y 1n) 1Y F and prior vector π 1 n Y 1n 5: Update the parameters θ by performing a gradient descent iteration of n (ψ(F, M, π), Y ) (i.e., Eq. (4)) 6: end for output : nonlinear mapping gθ |
| Open Source Code | No | The paper does not provide a direct link or explicit statement that the source code for the proposed DRPR method is publicly available. It only mentions pre-trained models used from another source. |
| Open Datasets | Yes | We evaluate our approach on the test sets of the MNIST (Le Cun et al., 1998), STL (Coates et al., 2011), CIFAR 10 and CIFAR 100 (Krizhevsky & Hinton, 2009) datasets with pre-trained models that are publicly available and optimized for cross entropy. We use the medium-scaled Caltech-UCSD Birds (CUB) dataset (Welinder et al., 2010) and Oxford Flowers-102 (Flowers) dataset (Nilsback & Zisserman, 2008). |
| Dataset Splits | Yes | CUB contains 11,788 bird images from 200 different species categories split into disjoint sets: 100 categories for training, 50 for validation and 50 for test. Flowers contains 8,189 flower images from 102 different species categories: 62 categories are used for training, 20 for validation and 20 for test. |
| Hardware Specification | Yes | We coded our method in Py Torch and ran all our experiments on a single Nvidia Ge Force GTX 1060 which has 6GB of RAM. |
| Software Dependencies | No | The paper states 'We coded our method in Py Torch' but does not specify the version number for PyTorch or any other software dependencies with their respective versions. |
| Experiment Setup | Yes | Mini-batch size: The training datasets of CUB and Flowers contain 5894 and 5878 images, respectively. In order to fit into memory, we set our mini-batch sizes as 421 (= 5894/14) and 735 ( 5878/8) for CUB and Flowers, respectively. Optimizer: We use the Adam optimizer with a learning rate of 10 5 to train both models ϕθ1 and gθ2. Initial temperature of our model: To make our optimization framework stable, we start with a temperature of 50. We then formulate our Bregman divergence as: d(fi, µc) = 1 temp fi µc 2 2 where fi and µc are the representations learned by our model. We decrease our temperature by 10% (i.e., tempt+1 = 0.9tempt) every 3000 epochs until the algorithm stops training. |