Learning de-identified representations of prosody from raw audio
Authors: Jack Weston, Raphael Lenain, Udeepa Meepegama, Emil Fristed
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal. ... Despite aggressive downsampling of the input and having no access to linguistic information, our model performs comparably to state-of-the-art speech representations on DAMMP, a new benchmark we introduce for spoken language understanding. ... In Figure 2, we investigate the tradeoff between performance and de-identifiability. ... Table 2. The results of our work (VQP), baseline representations and ablations on the DAMMP benchmark and the de-identification ratio (DIR). |
| Researcher Affiliation | Industry | Jack Weston 1 Raphaël Lenain 1 Udeepa Meepegama 1 Emil Fristed 1 1Novoic, London, UK. Correspondence to: Jack Weston <jack@novoic.com>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper states that models are pretrained using 'a proprietary framework built on top of Py Torch' and references SpeechBrain as a related project, but does not provide a link or explicit statement about the release of their own code (VQP) as open-source. |
| Open Datasets | Yes | Models using self-supervised pretraining consistently demonstrate the importance of large datasets (Devlin et al., 2018; Raffel et al., 2019; Brown et al., 2020). Audio Set (Gemmeke et al., 2017), for instance, is a large dataset for general-purpose audio machine learning... To standardize assessment of representations for spoken language understanding, we introduce a new benchmark, DAMMP. The dataset has parallel audio and text modalities of natural speech... DAMMP is composed of five datasets (Table 1) all with binary classification tasks where prosody is important: DAIC-WOZ (Low et al., 2020), ADRe SS (de la Fuente Garcia et al., 2020; Pompili et al., 2020), MUSt ARD (Bryant, 2010; Woodland & Voyer, 2011), CMU-MOSEI (Liu et al., 2018; Jain et al., 2018), and POM (Okada et al., 2016; Siddiquie et al., 2015). |
| Dataset Splits | Yes | We partition the train set, D = {(xi, yi)}n i=1, into timesteps, 1 = t0 < t1 < . . . < t S = n, and train our probe, pθ(y|x), such that at timestep ti the train set is {(xj, yj)}ti j=1 and we evaluate on set {xj, yj}ti+1 j=ti+1, calculating the codelength as per Voita & Titov (2020). ... DAIC-WOZ, ADRe SS, and CMUMOSEI already had a canonical test set disjoint by speaker, whereas for MUSt ARD and POM we sampled the datasets to make balanced test sets for the binary variable of interest. For MUSt ARD, the train/test sets are disjoint by TV show as well as speaker, to make the task harder. |
| Hardware Specification | Yes | We use a batch size of 128 samples and train on a single V100 GPU for 2.3 days. |
| Software Dependencies | No | The paper states, 'We pretrain our models using a proprietary framework built on top of Py Torch (Paszke et al., 2019),' but does not provide a specific version number for PyTorch or any other software dependency. |
| Experiment Setup | Yes | We uniformly mask 30% of all prosody tokens. The TCN comprises 9 layers, each with 30 filters, a stride of 1 and a kernel size of 2. We use exponentially increasing dilations of size 1, 2, 4, 8, 16, 32, 64, 128, 256 to yield a receptive field size of 512 frames. The 1 1 convolution similarly has 30 filters. The dropout probability is 10%. The product quantizer comprises 3 vector quantizers each of dimension 10 with an independent codebook of size 32... We choose a decay of γ = 0.99 for all quantizers and weight the commitment loss by α = 0.5. The linear layers have dimensionality 30. The Transformer encoder has 12 layers, 12 attention heads, inner (FFN) dimension 3, 072, embedding size 768, Re LU activation and a 10% dropout probability. ... We postulate that prosody temporal interactions are relatively short compared to language and restrict the sequence length to 32 words. During pretraining, we also require a minimum sequence length of 16 words. We train using K = 9 distractors. We linearly warm up the learning rate from 0 to a maximum of 1.5 10 5 at 10k steps before linearly decaying it to 0 at the step. The model trains for 250k steps using the Adam W optimizer (Loshchilov & Hutter, 2017). We use a batch size of 128 samples and train on a single V100 GPU for 2.3 days. |