Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Dysarthric speech reconstruction (DSR) aims to enhance the intelligibility of dysarthric speech. Compared with normal speech, the dysarthric speech is characterized by its pathological features, including discontinuous pronunciation, slow speech, hoarseness, and improper pauses. Significant disparities in the feature space between normal and dysarthric speech may result in suboptimal speech reconstruction, thereby degrading speech intelligibility. To enhance the reconstruction ability of speech feature spaces, this paper proposes a DSR model named the Encoding-Aligned Variational Autoencoder (EA-VAE). By incorporating alignment modules of frame-level embedding features, prior distributions, and duration into the encoder of the VAE, the model explicitly aligns the dysarthric speech encoding with a representation of the parallel normal speech. A shared decoder is then used to generate speech with improved intelligibility. Experimental results on the UASpeech benchmark confirm that EA-VAE achieves state-of-the-art performance, with a 31.7% relative word error rate reduction and the highest subjective MOS score (4.48), thoroughly validating the effectiveness and advancements of the proposed method in dysarthric speech reconstruction.
