Reference: Healthy reference speech of control speaker CF02.
ASR-TTS: Cascaded DSR pipeline with HuBERT-CTC ASR model and Tacotron 2 TTS model, followed by HiFi-GAN vocoder.
E2E-DSR: End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction.
ASA-DSR: Speaker indentity preservation in dysarthric speech reconstruction by adversarial speaker adaptation, with dysarthric speech fine-tuned ASR model as the content encoder.
Unit-DSR (proposed): A speech unit-based dysarthric speech reconstruction system, which is efficiently fine-tuned from a pre-trained HuBERT backbone using a multi-stage strategy. A unit HiFi-GAN vocoder is utilized for speech generation.
Diagram and example of Unit-DSR system
Fig. 1. (a) Diagram of the Unit-DSR system; (b) An example of original speech units of different speakers uttering 'bath', and the reconstructed norm units from the speech unit normalizer, which have a high correspondence with the reference speech units.
Dysarthric speech reconstruction for different speakers
4 dysarthric speakers with different speech intelligibility are used for experiments: M05(mid), F04(mid), M07(low), F02(low). 'F' and 'M' denote female and male respectively.
By replacing the target speaker code, The proposed multi-speaker unit-HiFiGAN vocoder can generate high-quality speech with different speaker identity. And in this demo page, the target speaker of the unit vocoder is CF02.