Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Intrinsic dimension of data representations in deep neural networks

Authors: Alessio Ansuini, Alessandro Laio, Jakob H. Macke, Davide Zoccolan

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we study the intrinsic dimensionality (ID) of datarepresentations, i.e. the minimal number of parameters needed to describe a representation. We find that, in a trained network, the ID is orders of magnitude smaller than the number of units in each layer. Across layers, the ID first increases and then progressively decreases in the final layers. Remarkably, the ID of the last hidden layer predicts classification accuracy on the test set. These results can neither be found by linear dimensionality estimates (e.g., with principal component analysis), nor in representations that had been artificially linearized. They are neither found in untrained networks, nor in networks that are trained on randomized labels. This suggests that neural networks that can generalize are those that transform the data into low-dimensional, but not necessarily flat manifolds.
Researcher Affiliation Academia Alessio Ansuini International School for Advanced Studies EMAIL Alessandro Laio International School for Advanced Studies EMAIL Jakob H. Macke Technical University of Munich EMAIL Davide Zoccolan International School for Advanced Studies EMAIL
Pseudocode Yes Figure 1: The Two NN estimator derives an estimate of intrinsic dimensionality from the statistics of nearestneighbour distances. 1) For each data point i compute the distance to its first and second neighbour (ri,1 and ri,2) 2) For each i compute 𝜇i = ri,2/ri,1. ... 3) Infer d from the empirical probability distribution of all the mi. 4) Repeat the calculation selecting a fraction of points at random. This gives the ID as a function of the scale.
Open Source Code Yes The code to compute the ID estimates with the Two NN method and to reproduce our experiments is available at this repository.
Open Datasets Yes We first investigated the variation of the ID across the layers of a VGG-16 network (20), pre-trained on Image Net (11), and fine-tuned and evaluated on a synthetic data-set of 1440 images (21). ...computed the average ID of the object manifolds corresponding to the 7 biggest Image Net categories, using 500 images per category... we generated a modified MNIST dataset (referred to as MNIST )...(38) Y. Le Cun and C. Cortes, MNIST handwritten digit database, 2010.
Dataset Splits No The paper mentions leaving out a 'test set' for the synthetic dataset and discusses performance 'without estimating the performance on an external validation set', indicating that explicit training/validation splits are not provided or used in the conventional sense for hyperparameter tuning.
Hardware Specification No The paper mentions performing calculations 'In a few seconds on a desktop PC' but does not provide specific hardware details such as CPU, GPU models, or memory.
Software Dependencies No The paper cites the PyTorch framework (37) but does not explicitly state the specific version numbers of any software libraries or dependencies used in their experiments.
Experiment Setup Yes We extracted representations at pooling layers after a convolution or a block of consecutive convolutions, and at fully connected layers. In the experiments with Res Nets, we extracted the representations after each Res Net block (19) and the average pooling before the output. ... we generated a modified MNIST dataset (referred to as MNIST ) by adding a luminance perturbation... with λ = 100...