NAS-Bench-ASR: Reproducible Neural Architecture Search for Speech Recognition
Authors: Abhinav Mehrotra, Alberto Gil C. P. Ramos, Sourav Bhattacharya, Łukasz Dudziak, Ravichander Vipperla, Thomas Chau, Mohamed S Abdelfattah, Samin Ishtiaq, Nicholas Donald Lane
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The dataset consists of 8, 242 unique models trained on the TIMIT audio dataset for three different target epochs, and each starting from three different initializations. The dataset also includes runtime measurements of all the models on a diverse set of hardware platforms. |
| Researcher Affiliation | Collaboration | 1Samsung AI Center, Cambridge 2University of Cambridge Equal contribution {a.mehrotra1,a.gilramos,sourav.b1,l.dudziak}@samsung.com |
| Pseudocode | No | The paper describes methods and uses algorithms but does not include structured pseudocode blocks or sections explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The NAS-Bench-ASR dataset and the code can be downloaded from https://github.com/ Abhinav Mehrotra/nb-asr. |
| Open Datasets | Yes | To build the dataset, we have trained 8, 242 unique convolutional neural network architectures on the TIMIT dataset [Garofolo et al., 1993]. |
| Dataset Splits | Yes | Following Lee & Hon [1989], we split the core test dataset into a test partition, consisting of 24 speakers, and a validation partition. |
| Hardware Specification | Yes | We leveraged NVIDIA V100 and P40 GPUs, and decreased training time by increasing throughput via the bucketing strategy based on the audio length. [...] Additionally, we computed the number of parameters and floating point operations (FLOPs) for each of the architectures and measured their latency on two commonly used hardware platforms: Tesla 1080Ti and Jetson Nano. |
| Software Dependencies | No | Individual models are trained using a Tensor Flow-based training pipeline running on a single GPU. |
| Experiment Setup | Yes | The best macro structure parameters are presented above (see 3.1), whereas the best LR was 10 4, and the decay factor and start epoch were: (i) 0.9 and 5 for target epoch 40, (ii) 0.631 and 2 for target epoch 10, and (iii) 0.398 and 1 for target epoch 5 respectively. [...] For efficiency, we also use a batch bucketing strategy, where a batch size of 64 is used for audio utterances smaller than 2s, and a batch size of 32 is used otherwise. We used CTC beam-search decoder with beam-size of 12. [...] Our dataset contains logs of each of the 8, 242 models trained with three different seeds and for three target epochs (5, 10 and 40), thus generating a total of 74, 178 model training traces. |