Neural Speed Reading via Skim-RNN

Authors: Minjoon Seo, Sewon Min, Ali Farhadi, Hannaneh Hajishirzi

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show that Skim-RNN can achieve significantly reduced computational cost without losing accuracy compared to standard RNNs across five different natural language tasks.
Researcher Affiliation Collaboration Clova AI Research, NAVER1 University of Washington2 Seoul National University3 Allen Institute for Artificial Intelligence4 XNOR.AI5
Pseudocode No The paper describes the Skim-RNN architecture and its inference and training processes using mathematical equations and textual explanations, but it does not include a formally labeled pseudocode or algorithm block.
Open Source Code No The paper does not include any explicit statements about making its source code publicly available, nor does it provide a link to a code repository.
Open Datasets Yes Table 1 lists common datasets used (SST, Rotten Tomatoes, IMDb, AGNews, CBT-NE, CBT-CN, SQuAD) and specifies the 'Number of examples' for training, validation, and test sets for several of these, indicating the use of established public datasets and their splits.
Dataset Splits Yes Table 1 explicitly lists 'Number of examples' for training, validation, and test sets for datasets like SST, Rotten Tomatoes, IMDb, AGNews, CBT-NE, and CBT-CN. For SQuAD, it provides train and dev (validation) sizes. This clearly indicates specified dataset splits.
Hardware Specification Yes Comparing between Num Py with CPU and Tensor Flow with GPU (Titan X), we observe that the former has 1.5 times lower latency (75 µs vs 110 µs per token) for LSTM of d = 100.
Software Dependencies No The paper mentions 'Python (Num Py)', 'Tensor Flow', and 'Py Torch' in the context of benchmarking speed. However, it does not provide specific version numbers for these software components, which is necessary for reproducible software dependencies.
Experiment Setup Yes We use Adam (Kingma & Ba, 2015) for optimization, with initial learning rate of 0.0001. For Skim-LSTM, τ = max(0.5, exp( rn)) where r = 1e 4 and n is the global training step, following Jang et al. (2017). We experiment on different sizes of big LSTM (d {100, 200}) and small LSTM (d {5, 10, 20}) and the ratio between the model loss and the skim loss (γ {0.01, 0.02}) for Skim-LSTM. We use batch size of 32 for SST and Rotten Tomatoes, and 128 for others. For all models, we stop early when the validation accuracy does not increase for 3000 global steps.