STCN: Stochastic Temporal Convolutional Networks

Authors: Emre Aksan, Otmar Hilliges

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed variants STCN and STCN-dense both quantitatively and qualitatively on modeling of digital handwritten text and speech. We compare with vanilla TCNs, RNNs, VRNNs and state-of-the art models on the corresponding tasks. Table 1: Average log-likelihood per sequence on TIMIT, Blizzard, IAM-On DB and Deepwriting datasets.
Researcher Affiliation Academia Emre Aksan & Otmar Hilliges Department of Computer Science ETH Zurich, Switzerland {emre.aksan, otmar.hilliges}@inf.ethz.ch
Pseudocode No The paper includes a diagram (Figure 4) of the model architecture but no textual pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://ait.ethz.ch/projects/2019/stcn/.
Open Datasets Yes The IAM-On DB data is split and pre-processed as done in (Chung et al., 2015). Aksan et al. (2018) extend this dataset with additional samples and better pre-processing. TIMIT and Blizzard are standard benchmark dataset in speech modeling.
Dataset Splits Yes The IAM-On DB data is split and pre-processed as done in (Chung et al., 2015). We applied early stopping by measuring the ELBO performance on the validation splits.
Hardware Specification Yes We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Software Dependencies No We implement STCN models in Tensorflow (Abadi et al., 2016).
Experiment Setup Yes In all STCN experiments we applied KL annealing. In all tasks, the weight of the KL term is initialized with 0 and increased by 1 e 4 at every step until it reaches 1. The batch size was 20 for all datasets except for Blizzard where it was 128. We use the ADAM optimizer with its default parameters and exponentially decay the learning rate. For the handwriting datasets the learning rate was initialized with 5 e 4 and followed a decay rate of 0.94 over 1000 decay steps. On the speech datasets it was initialized with 1 e 3 and decayed with a rate of 0.98.