NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition

Authors: Abhinav Mehrotra, Alberto Gil C. P. Ramos, Sourav Bhattacharya, Łukasz Dudziak, Ravichander Vipperla, Thomas Chau, Mohamed S Abdelfattah, Samin Ishtiaq, Nicholas Donald Lane

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The dataset consists of 8, 242 unique models trained on the TIMIT audio dataset for three different target epochs, and each starting from three different initializations. The dataset also includes runtime measurements of all the models on a diverse set of hardware platforms.
Researcher Affiliation Collaboration 1Samsung AI Center, Cambridge 2University of Cambridge Equal contribution {a.mehrotra1,a.gilramos,sourav.b1,l.dudziak}@samsung.com
Pseudocode No The paper describes methods and uses algorithms but does not include structured pseudocode blocks or sections explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The NAS-Bench-ASR dataset and the code can be downloaded from https://github.com/ Abhinav Mehrotra/nb-asr.
Open Datasets Yes To build the dataset, we have trained 8, 242 unique convolutional neural network architectures on the TIMIT dataset [Garofolo et al., 1993].
Dataset Splits Yes Following Lee & Hon [1989], we split the core test dataset into a test partition, consisting of 24 speakers, and a validation partition.
Hardware Specification Yes We leveraged NVIDIA V100 and P40 GPUs, and decreased training time by increasing throughput via the bucketing strategy based on the audio length. [...] Additionally, we computed the number of parameters and floating point operations (FLOPs) for each of the architectures and measured their latency on two commonly used hardware platforms: Tesla 1080Ti and Jetson Nano.
Software Dependencies No Individual models are trained using a Tensor Flow-based training pipeline running on a single GPU.
Experiment Setup Yes The best macro structure parameters are presented above (see 3.1), whereas the best LR was 10 4, and the decay factor and start epoch were: (i) 0.9 and 5 for target epoch 40, (ii) 0.631 and 2 for target epoch 10, and (iii) 0.398 and 1 for target epoch 5 respectively. [...] For efficiency, we also use a batch bucketing strategy, where a batch size of 64 is used for audio utterances smaller than 2s, and a batch size of 32 is used otherwise. We used CTC beam-search decoder with beam-size of 12. [...] Our dataset contains logs of each of the 8, 242 models trained with three different seeds and for three target epochs (5, 10 and 40), thus generating a total of 74, 178 model training traces.