Minimalistic Unsupervised Representation Learning with the Sparse Manifold Transform

Authors: Yubei Chen, Zeyu Yun, Yi Ma, Bruno Olshausen, Yann LeCun

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With a one-layer deterministic (one training epoch) sparse manifold transform, it is possible to achieve 99.3% KNN top-1 accuracy on MNIST, 81.1% KNN top-1 accuracy on CIFAR-10, and 53.2% on CIFAR-100. With simple grayscale augmentation, the model achieves 83.2% KNN top-1 accuracy on CIFAR-10 and 57% on CIFAR-100. These results significantly close the gap between simplistic white-box methods and SOTA methods. We also provide visualization to illustrate how an unsupervised representation transform is formed.
Researcher Affiliation Collaboration Yubei Chen1,2, Zeyu Yun4,5, Yi Ma4, Bruno Olshausen4,5,6, Yann Le Cun1,2,3 1 Meta AI 2 Center for Data Science, 3 Courant Institute, New York University 4 EECS Dept., 5 Redwood Center, 6 Helen Wills Neuroscience Inst., UC Berkeley
Pseudocode Yes Algorithm 1: Online 1-sparse dictionary learning heuristic (K-mean clustering)
Open Source Code No The paper does not contain an explicit statement of open-source code release or a link to a repository for the described methodology.
Open Datasets Yes achieves 99.3% KNN top-1 accuracy on MNIST, 81.1% KNN top-1 accuracy on CIFAR-10, and 53.2% on CIFAR-100 without data augmentation. ... We use Wiki Text-103 corpus [76] to train SMT word embedding, which contains around 103 million tokens.
Dataset Splits No The paper mentions using standard datasets like MNIST, CIFAR-10, and CIFAR-100 for evaluation and refers to “training examples” in ablation studies, but it does not explicitly state the training, validation, and test splits (e.g., percentages or sample counts) or cite a specific source for these splits.
Hardware Specification No The paper does not provide any specific details regarding the hardware used for running experiments, such as GPU models, CPU models, or cloud computing specifications.
Software Dependencies No The paper mentions using “Basic English Tokenizer from torchtext”, “transform module in pytorchvision”, and “solo-learn” for pretraining, but it does not specify any version numbers for these software dependencies or other key components.
Experiment Setup Yes For both MNIST and CIFAR, we use 6x6 image patches, i.e., T = 6. Once patch embeddings are computed, we use a spatial average pooling layer with ks = 4 and stride = 2 to aggregate patch embeddings to compute the final image-level embeddings. ... For all the benchmark results on the deep SSL model, we follow the pretraining procedure used in solo-learn. Each method is pretrained for 1000 epochs. ... For both methods, we use resnet-18 as the backbone, and a batch size of 256, LARS optimizer, with 10 epochs of warmup follows by a cosine decay. For the Sim CLR model, we set the learning rate to be 0.4 and a weight decay of 1e-5. We also set the temperature to 0.2, the dimension for the projection output layer to 256, and the dimension of the projection hidden layer to 2048. For VICReg, we set the learning rate to 0.3 and a weight decay of 1e-4. We also set the temperature to 0.2, and set the dimension for the projection output and hidden layer to 2048. We set the weight for similarity loss and variance loss to 25, and the weight for covariance loss to 1.