reproducibilityindex.ai

Learning de-identified representations of prosody from raw audio

Authors: Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose a method for learning de-identiﬁed prosody representations from raw audio using a contrastive self-supervised signal. ... Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. ... In Figure 2, we investigate the tradeoff between performance and de-identiﬁability. ... Table 2. The results of our work (VQP), baseline representations and ablations on the DAMMP benchmark and the de-identiﬁcation ratio (DIR).
Researcher Affiliation	Industry	Jack Weston 1 Raphaël Lenain 1 Udeepa Meepegama 1 Emil Fristed 1 1Novoic, London, UK. Correspondence to: Jack Weston <jack@novoic.com>.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper states that models are pretrained using 'a proprietary framework built on top of Py Torch' and references SpeechBrain as a related project, but does not provide a link or explicit statement about the release of their own code (VQP) as open-source.
Open Datasets	Yes	Models using self-supervised pretraining consistently demonstrate the importance of large datasets (Devlin et al., 2018; Raffel et al., 2019; Brown et al., 2020). Audio Set (Gemmeke et al., 2017), for instance, is a large dataset for general-purpose audio machine learning... To standardize assessment of representations for spoken language understanding, we introduce a new benchmark, DAMMP. The dataset has parallel audio and text modalities of natural speech... DAMMP is composed of ﬁve datasets (Table 1) all with binary classiﬁcation tasks where prosody is important: DAIC-WOZ (Low et al., 2020), ADRe SS (de la Fuente Garcia et al., 2020; Pompili et al., 2020), MUSt ARD (Bryant, 2010; Woodland & Voyer, 2011), CMU-MOSEI (Liu et al., 2018; Jain et al., 2018), and POM (Okada et al., 2016; Siddiquie et al., 2015).
Dataset Splits	Yes	We partition the train set, D = {(xi, yi)}n i=1, into timesteps, 1 = t0 < t1 < . . . < t S = n, and train our probe, pθ(y\|x), such that at timestep ti the train set is {(xj, yj)}ti j=1 and we evaluate on set {xj, yj}ti+1 j=ti+1, calculating the codelength as per Voita & Titov (2020). ... DAIC-WOZ, ADRe SS, and CMUMOSEI already had a canonical test set disjoint by speaker, whereas for MUSt ARD and POM we sampled the datasets to make balanced test sets for the binary variable of interest. For MUSt ARD, the train/test sets are disjoint by TV show as well as speaker, to make the task harder.
Hardware Specification	Yes	We use a batch size of 128 samples and train on a single V100 GPU for 2.3 days.
Software Dependencies	No	The paper states, 'We pretrain our models using a proprietary framework built on top of Py Torch (Paszke et al., 2019),' but does not provide a specific version number for PyTorch or any other software dependency.
Experiment Setup	Yes	We uniformly mask 30% of all prosody tokens. The TCN comprises 9 layers, each with 30 ﬁlters, a stride of 1 and a kernel size of 2. We use exponentially increasing dilations of size 1, 2, 4, 8, 16, 32, 64, 128, 256 to yield a receptive ﬁeld size of 512 frames. The 1 1 convolution similarly has 30 ﬁlters. The dropout probability is 10%. The product quantizer comprises 3 vector quantizers each of dimension 10 with an independent codebook of size 32... We choose a decay of γ = 0.99 for all quantizers and weight the commitment loss by α = 0.5. The linear layers have dimensionality 30. The Transformer encoder has 12 layers, 12 attention heads, inner (FFN) dimension 3, 072, embedding size 768, Re LU activation and a 10% dropout probability. ... We postulate that prosody temporal interactions are relatively short compared to language and restrict the sequence length to 32 words. During pretraining, we also require a minimum sequence length of 16 words. We train using K = 9 distractors. We linearly warm up the learning rate from 0 to a maximum of 1.5 10 5 at 10k steps before linearly decaying it to 0 at the step. The model trains for 250k steps using the Adam W optimizer (Loshchilov & Hutter, 2017). We use a batch size of 128 samples and train on a single V100 GPU for 2.3 days.