Towards Voice Reconstruction from EEG during Imagined Speech
Authors: Young-Eun Lee, Seo-Hyun Lee, Sang-Ho Kim, Seong-Whan Lee
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose Neuro Talk, which converts non-invasive brain signals of imagined speech into the user s own voice. Our model was trained with spoken speech EEG which was generalized to adapt to the domain of imagined speech, thus allowing natural correspondence between the imagined speech and the voice as a ground truth. In our framework, an automatic speech recognition decoder contributed to decomposing the phonemes of the generated speech, demonstrating the potential of voice reconstruction from unseen words. Our results imply the potential of speech synthesis from human EEG signals, not only from spoken speech but also from the brain signals of imagined speech. ... We performed an ablation study of GRU in the generator and discriminator, losses of GAN, reconstruction, and CTC to verify the effect of each module and loss on the model performance. As demonstrated in Table 2, CER of all cases were mostly inferior to the baseline, indicating that all approaches perform their roles. |
| Researcher Affiliation | Academia | 1 Department of Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea 2 Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea |
| Pseudocode | No | The paper describes the architectures of the generator, discriminator, vocoder, and ASR model and their components using text and diagrams (Figure 1 and Figure 2), but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We released the source code and sample data on Git Hub at: https://github.com/youngeun1209/Neuro Talk. |
| Open Datasets | No | The primary dataset used in the experiments was collected from six participants as described under 'Experimental Setup'. The paper states: "Since the dataset contains human-derived biosignals, only a small sample dataset could be published to reproduce and execute code." This indicates that the full experimental dataset is not publicly available. |
| Dataset Splits | Yes | The dataset was divided into 5-fold subsets in training, validation, and test dataset according to the random selection with a random seed. |
| Hardware Specification | Yes | We trained the model on an NVIDIA Ge Force RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions software used for preprocessing (Python and Matlab using Open BMI Toolbox, BBCI Toolbox, and EEGLAB) and pre-trained models (Hi Fi-GAN, Hu BERT) but does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | The generator had three residual blocks with the kernel size of 3, 7, and 11, each dilation of 1, 3, and 5, and upsampling rate of 3, 2, and 2 with twice upsample kernel size. The number of the initial channel was 1,024, and the directional GRU dimension was half of the initial channel. The discriminator had the same residual block as the generator, but a downsampling rate of 3, 3, and 3 with twice the kernel size. The number of the final channel was 64, and the directional GRU dimension was half of the final channel. The mel-spectrogram was managed in a sampling rate of 22,050 Hz and the STFT and mel function was conducted with n FFT of 1,024, the window of 1,024, hop size of 256, and 80 bands of mel-spectrogram. Initial training was conducted with an initial learning rate of 10 4, and the finetuning was conducted with a lower learning rate such as 10 5 in the maximum epoch of 500 and a batch size of 10. We used Adam W optimizer (Loshchilov and Hutter 2017) with searched parameters of β1=0.8, β2=0.99, and weight decay λ=0.01, which was scheduled by 0.999 factor in every epoch. |