Analyzing and Improving Representations with the Soft Nearest Neighbor Loss
Authors: Nicholas Frosst, Nicolas Papernot, Geoffrey Hinton
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate several use cases of the loss. As an analytical tool, it provides insights into the evolution of class similarity structures during learning. Surprisingly, we find that maximizing the entanglement of representations of different classes in the hidden layers is beneficial for discrimination in the final layer, possibly because it encourages representations to identify class-independent similarity structures. Maximizing the soft nearest neighbor loss in the hidden layers leads not only to better-calibrated estimates of uncertainty on outlier data but also marginally improved generalization. Data that is not from the training distribution can be recognized by observing that in the hidden layers, it has fewer than the normal number of neighbors from the predicted class. We trained a convolutional network3 on MNIST, Fashion-MNIST and SVHN, as well as a Res Net4 on CIFAR10. Two variants of each model were trained with a different objective: (1) a baseline with cross-entropy only and (2) an entangled variant balancing both cross-entropy and the soft nearest neighbor loss as per Equation 3. As reported in Table 1, all entangled models marginally outperformed their non-entangled counterparts to some extent. |
| Researcher Affiliation | Industry | Nicholas Frosst 1 Nicolas Papernot 1 Geoffrey Hinton 1 1Google Brain. Correspondence to: N. Frosst <frosst@google.com>, N. Papernot <papernot@google.com>. |
| Pseudocode | No | The paper provides mathematical definitions and descriptions but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | We open-sourced Tensor Flow code outlining the matrix operations needed to compute this loss efficiently. |
| Open Datasets | Yes | We trained a convolutional network3 on MNIST, Fashion-MNIST and SVHN, as well as a Res Net4 on CIFAR10. |
| Dataset Splits | Yes | Figure 7. Test accuracy as a function of the soft nearest neighbor hyper-parameter α for 64 training runs of a Res Net on CIFAR10. Each run is selected by Vizier (Golovin et al., 2017) to maximize validation accuracy by tuning the learning rate, SNNL hyper-parameter α, and temperature T. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory specifications). |
| Software Dependencies | No | The paper mentions 'Tensor Flow code' but does not specify any version numbers for TensorFlow or other software dependencies. |
| Experiment Setup | Yes | The architecture we used was made up of two convolutional layers followed by three fully connected layers and a final softmax layer. The network was trained with Adam at a learning rate of 1e-4, a batch size of 256 for 14000 steps. The Res Net v2 with 15 layers was trained for 106 epochs with a exponential decreasing learning rate starting at 0.4. |