Low-Rank Time-Frequency Synthesis

Authors: Cédric Févotte, Matthieu Kowalski

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We describe two expectation-maximization algorithms for estimation in the new model and report audio signal processing results with music decomposition and speech enhancement.
Researcher Affiliation Academia C edric F evotte Laboratoire Lagrange (CNRS, OCA & Universit e de Nice) Nice, France cfevotte@unice.fr Matthieu Kowalski Laboratoire des Signaux et Syst emes (CNRS, Sup elec & Universit e Paris-Sud) Gif-sur-Yvette, France kowalski@lss.supelec.fr
Pseudocode Yes E-step: z(i) = E{z|x, λ(i)} = α(i) + β λ(i) Φ (x Φα(i)) (16) M-step: (f, n), α(i+1) fn = v(i) fn v(i) fn + β z(i) fn (17) (W(i+1), H(i+1)) = arg min W,H 0 fn DIS |α(i+1) fn |2|[WH]fn (18) T x Φα(i+1) 2 F (19)
Open Source Code No The paper mentions "Sound examples are provided in the supplementary material." but does not state that the source code for the methodology is openly available or provide a link to it.
Open Datasets Yes The training data, with sampling rate 16k Hz, is extracted from the TIMIT database [12].
Dataset Splits No The paper mentions training and test data but does not explicitly describe a separate validation dataset split.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments.
Software Dependencies No The paper mentions "Large Time-Frequency Analysis Toolbox (LTFAT) [7]" but does not provide specific version numbers for software dependencies.
Experiment Setup Yes We use a 2048 samples-long ( 46 ms) Hann window for the tonal layer, and a 128 samples-long ( 3 ms) Hann window for the transient layer, both with a 50% time overlap. The number of latent components in the two layers is set to K = 3. The two t-f bases are Gabor frames with Hann window of length 512 samples ( 32 ms) for the tonal layer and 32 samples ( 2 ms) for the transient layer, both with 50% overlap. The hyperparameter λ is gradually decreased to a negligible value during iterations (resulting in a negligible residual e), a form of warm-restart strategy [13]. Wtrain tonal and Wtrain transient are fixed pre-trained dictionaries of dimension K = 500, obtained from 30 min of training speech containing male and female speakers. The noise dictionaries Wnoise tonal and Wnoise transient are learnt from the noisy data, using K = 2.