Sequence Modeling via Segmentations
Authors: Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, Li Deng
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate our approach on text segmentation and speech recognition tasks. In addition to quantitative results, we also show that our approach can discover meaningful segments in their respective application contexts. ... Section 4 includes two case studies to demonstrate the usefulness of our approach through both quantitative and qualitative results. |
| Researcher Affiliation | Collaboration | 1Microsoft Research 2Carnegie Mellon University 3Amazon 4Citadel Securities LLC. Correspondence to: Chong Wang <chowang@microsoft.com>. |
| Pseudocode | Yes | Algorithm 1 SWAN beam search decoding |
| Open Source Code | No | We plan to release this package in a deep learning framework. |
| Open Datasets | Yes | We use two datasets including AP (Associated Press, 2, 246 documents) from Blei et al. (2003) and Cite ULike scientific article abstracts (16, 980 documents) from Wang & Blei (2011). ... We evaluate SWAN on the TIMIT corpus following the setup in Deng et al. (2006). |
| Dataset Splits | No | The paper mentions 'a development set for early stopping' for LDA, but does not provide explicit training/validation/test dataset splits (percentages or counts) for all experiments or their own model. |
| Hardware Specification | No | The paper does not describe the specific hardware used, such as GPU or CPU models, for running its experiments. |
| Software Dependencies | No | The paper mentions software like 'torch' and 'Adam algorithm' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For our model, the inference network is a 2-layer feed-forward neural network with Re LU nonlinearity. A two-layer GRU is used to model the segments in the distribution p(y1:T |Wθ(ζ)). And we vary the hidden unit size (as well as the word embedding size) to be 100, 150 and 200, and the maximum segment length L to be 1, 2 and 3. We use Adam algorithm (Kingma & Ba, 2014) for optimization with batch size 32 and learning rate 0.001. ... Our SWAN model consists of a 5-layer bidirectional GRU with 300 hidden units as the encoder and two 2-layer unidirectional GRU(s) with 600 hidden units, one for the segments and the other for connecting the segments in SWAN. We set the maximum segment length L = 3. To reduce the temporal input size for SWAN, we add a temporal convolutional layer with stride 2 and width 2 at the end of the encoder. For optimization, we largely followed the strategy in Zhang et al. (2017). We use Adam (Kingma & Ba, 2014) with learning rate 4e-4. We then use stochastic gradient descent with learning rate 3e-5 for fine-tuning. Batch size 20 is used during training. We use dropout with probability of 0.3 across the layers except for the input and output layers. Beam size 40 is used for decoding. |