reproducibilityindex.ai

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

Authors: Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate that, due to their expressiveness, learnability, and efﬁciency, we can use K-matrices as a drop-in replacement for linear components in deep learning models. In Section 3.1, we use K-matrices to replace hand-crafted structure in two different settings. We simplify the six steps of ﬁlter bank computation in speech preprocessing into a single learnable K-matrix step, with only an 0.4% accuracy drop on the TIMIT speech recognition task. We use K-matrices to replace channel shufﬂes in Shufﬂe Net, improving Image Net classiﬁcation accuracy by up to 5%. In Section 3.2, we show that K-matrices can successfully recover latent structure; a K-matrix is used to learn latent permutations in a permuted image dataset (Permuted CIFAR), resulting in 9 points higher accuracy in a downstream CNN model. In Section 3.3, we show that our efﬁcient K-matrix multiplication implementation can be applied to speed up real-world tasks: we replace linear layers with K-matrices in a Dynamic Conv-Transformer network to attain 36% faster end-to-end inference speed with a 1.0 drop in BLEU score on the IWSLT14 German English translation task.
Researcher Affiliation	Academia	Tri Dao 1, Nimit Sharad Sohoni 2, Albert Gu 1, Matthew Eichhorn 3, Amit Blonder 4, Megan Leszczynski 1, Atri Rudra 4, Christopher Ré 1 1 Department of Computer Science, Stanford University 2 Institute for Computational and Mathematical Engineering, Stanford University 3 Center for Applied Mathematics, Cornell University 4 Department of Computer Science and Engineering, University at Buffalo, The State University of New York
Pseudocode	No	The paper describes mathematical definitions and theoretical constructions but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	Code that implements Kaleidoscope matrix multiplication is available at https://github.com/ Hazy Research/learning-circuits
Open Datasets	Yes	We evaluate our speech recognition models on the TIMIT speech corpus (Garofolo et al., 1993), a standard benchmark for speech recognition. We evaluate the CNN architectures on the image classiﬁcation task of the standard Image Net dataset (Russakovsky et al., 2015). In this task, we use a permuted image classiﬁcation dataset (Permuted CIFAR-10).
Dataset Splits	Yes	Table 2: Top-1 classiﬁcation accuracy of Shufﬂe Net on Image Net validation set... We use the standard data augmentation, training, and evaluation pipeline as in (Xie et al., 2017). The input is audio (16-bit, 16 k Hz .wav format), and the target is the transcription into a sequence of phonemes (units of spoken sound)... (i) the waveform is framed (split into chunks of 25 ms each that overlap by 10 ms each)... Table 3: Permuted CIFAR-10 validation set classiﬁcation accuracy (%).
Hardware Specification	Yes	We train with SGD on 8 GPUs for 90 epochs... We run the decoding script on the IWSLT-14 De-En test set in single-threaded mode on a server Intel Xeon CPU E5-2690 v4 at 2.60GHz, and measure wall-clock time.
Software Dependencies	Yes	We use Py Torch (Paszke et al., 2017), the Kaldi speech recognition toolkit (Povey et al., 2011), and the Py Torch-Kaldi toolkit (Ravanelli et al., 2019) for developing Py Torch speech recognition models for all our experiments and evaluations. We use the implementation from the Fairseq library(Ott et al., 2019),9 with Py Torch version 1.2.
Experiment Setup	Yes	We train with SGD on 8 GPUs for 90 epochs, with a total batch size of 2048 and initial learning rate 0.8. For the 1.0 Shufﬂe Net g8 architecture, we reduce the total batch size to 1792 to ﬁt into GPU memory, and correspondingly linearly scale the initial learning rate to 0.7. Other hyperparameters (e.g. learning rate schedule, weight decay, etc.) are kept the same as in the Shufﬂe Net paper (Zhang et al., 2018). All models are trained for 200 total epochs, with the Adam optimizer. We use the standard learning rate schedule and weight decay from Mostafa & Wang (2019). We grid search the initial learning rate for the preprocessing layer (if applicable) in {5e-5, 1e-4, 2e-4, 4e-4, 8e-4, 1.6e-3}, and ﬁx all other hyperparameters (including the initial learning rates for the other parts of the network) to their default values in the Py Torch-Kaldi repository. The model and any preprocessing layers are trained end-to-end with the RMSProp optimizer for 24 epochs (as per the defaults in Py Torch-Kaldi).