Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

Authors: Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate that, due to their expressiveness, learnability, and efficiency, we can use K-matrices as a drop-in replacement for linear components in deep learning models. In Section 3.1, we use K-matrices to replace hand-crafted structure in two different settings. We simplify the six steps of filter bank computation in speech preprocessing into a single learnable K-matrix step, with only an 0.4% accuracy drop on the TIMIT speech recognition task. We use K-matrices to replace channel shuffles in Shuffle Net, improving Image Net classification accuracy by up to 5%. In Section 3.2, we show that K-matrices can successfully recover latent structure; a K-matrix is used to learn latent permutations in a permuted image dataset (Permuted CIFAR), resulting in 9 points higher accuracy in a downstream CNN model. In Section 3.3, we show that our efficient K-matrix multiplication implementation can be applied to speed up real-world tasks: we replace linear layers with K-matrices in a Dynamic Conv-Transformer network to attain 36% faster end-to-end inference speed with a 1.0 drop in BLEU score on the IWSLT14 German English translation task.
Researcher Affiliation Academia Tri Dao 1, Nimit Sharad Sohoni 2, Albert Gu 1, Matthew Eichhorn 3, Amit Blonder 4, Megan Leszczynski 1, Atri Rudra 4, Christopher Ré 1 1 Department of Computer Science, Stanford University 2 Institute for Computational and Mathematical Engineering, Stanford University 3 Center for Applied Mathematics, Cornell University 4 Department of Computer Science and Engineering, University at Buffalo, The State University of New York
Pseudocode No The paper describes mathematical definitions and theoretical constructions but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code that implements Kaleidoscope matrix multiplication is available at https://github.com/ Hazy Research/learning-circuits
Open Datasets Yes We evaluate our speech recognition models on the TIMIT speech corpus (Garofolo et al., 1993), a standard benchmark for speech recognition. We evaluate the CNN architectures on the image classification task of the standard Image Net dataset (Russakovsky et al., 2015). In this task, we use a permuted image classification dataset (Permuted CIFAR-10).
Dataset Splits Yes Table 2: Top-1 classification accuracy of Shuffle Net on Image Net validation set... We use the standard data augmentation, training, and evaluation pipeline as in (Xie et al., 2017). The input is audio (16-bit, 16 k Hz .wav format), and the target is the transcription into a sequence of phonemes (units of spoken sound)... (i) the waveform is framed (split into chunks of 25 ms each that overlap by 10 ms each)... Table 3: Permuted CIFAR-10 validation set classification accuracy (%).
Hardware Specification Yes We train with SGD on 8 GPUs for 90 epochs... We run the decoding script on the IWSLT-14 De-En test set in single-threaded mode on a server Intel Xeon CPU E5-2690 v4 at 2.60GHz, and measure wall-clock time.
Software Dependencies Yes We use Py Torch (Paszke et al., 2017), the Kaldi speech recognition toolkit (Povey et al., 2011), and the Py Torch-Kaldi toolkit (Ravanelli et al., 2019) for developing Py Torch speech recognition models for all our experiments and evaluations. We use the implementation from the Fairseq library(Ott et al., 2019),9 with Py Torch version 1.2.
Experiment Setup Yes We train with SGD on 8 GPUs for 90 epochs, with a total batch size of 2048 and initial learning rate 0.8. For the 1.0 Shuffle Net g8 architecture, we reduce the total batch size to 1792 to fit into GPU memory, and correspondingly linearly scale the initial learning rate to 0.7. Other hyperparameters (e.g. learning rate schedule, weight decay, etc.) are kept the same as in the Shuffle Net paper (Zhang et al., 2018). All models are trained for 200 total epochs, with the Adam optimizer. We use the standard learning rate schedule and weight decay from Mostafa & Wang (2019). We grid search the initial learning rate for the preprocessing layer (if applicable) in {5e-5, 1e-4, 2e-4, 4e-4, 8e-4, 1.6e-3}, and fix all other hyperparameters (including the initial learning rates for the other parts of the network) to their default values in the Py Torch-Kaldi repository. The model and any preprocessing layers are trained end-to-end with the RMSProp optimizer for 24 epochs (as per the defaults in Py Torch-Kaldi).