Audio & Attention Map Demo

FRAGMENTVC: ANY-TO-ANY VOICE CONVERSION BY END-TO-END EXTRACTING AND FUSING FINE-GRAINED VOICE FRAGMENTS WITH ATTENTION


Abstract: Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn't require parallel data. Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC.

GitHub (Source Code) arXiv (Preprint)

Seen-to-seen conversion

In the following sections, there are 4 conversion pairs, each containing 4 speech utterances. The first 2 utterances are drawn from the CSTR VCTK Corpus:
  1. An utterance from the source speaker, termed as source utterance
  2. An utterance from the target speaker, which is of the same word transcription as the source utterance, termed as authentic utterance
And the rest 2 utterances are the conversion results using the source utterance as source:
  1. A synthetic utterance generated with the authentic utterance as target
  2. A synthetic utterance generated with 5 randomly sampled utterances from the target speaker as target
Pair 1
Source speaker
p225
Target speaker
p227
Transcription
“ Ask her to bring these things with her from the store. ”
Source utterance
Authentic utterance
from the target speaker
Conversion result
with the authentic utterance as target

Conversion result
with 5 randomly sampled target utterances

Pair 2
Source speaker
p227
Target speaker
p225
Transcription
“ Many complicated ideas about the rainbow have been formed. ”
Source utterance
Authentic utterance
from the target speaker
Conversion result
with the authentic utterance as target

Conversion result
with 5 randomly sampled target utterances

Pair 3
Source speaker
p228
Target speaker
p232
Transcription
“ Many complicated ideas about the rainbow have been formed. ”
Source utterance
Authentic utterance
from the target speaker
Conversion result
with the authentic utterance as target

Conversion result
with 5 randomly sampled target utterances

Pair 4
Source speaker
p232
Target speaker
p228
Transcription
“ Ask her to bring these things with her from the store. ”
Source utterance
Authentic utterance
from the target speaker
Conversion result
with the authentic utterance as target

Conversion result
with 5 randomly sampled target utterances

Unseen-to-unseen conversion

In the following sections, there are 4 conversion pairs, each containing 4 speech utterances. The first 2 utterances are drawn from the CMU Arctic dataset:
  1. An utterance from the source speaker, termed as source utterance
  2. An utterance from the target speaker, which is of the same word transcription as the source utterance, termed as authentic utterance

And the last one is the conversion result generated with the source utterance as source and 10 randomly sampled utterances from the target speaker as target.

Pair 1
Source speaker
slt
Target speaker
lnh
Transcription
“ The Warden with a quart of champagne. ”
Source utterance
Authentic utterance
from the target speaker
Conversion results

FragmentVC
with 10 randomly sampled target utterances
AdaIN
AutoVC
Pair 2
Source speaker
clb
Target speaker
rms
Transcription
“ The scents of strange vegetation blew off the tropic land. ”
Source utterance
Authentic utterance
from the target speaker
Conversion results

FragmentVC
with 10 randomly sampled target utterances
AdaIN
AutoVC
Pair 3
Source speaker
bdl
Target speaker
ljm
Transcription
“ The woman in you is only incidental, accidental, and irrelevant. ”
Source utterance
Authentic utterance
from the target speaker
Conversion results

FragmentVC
with 10 randomly sampled target utterances
AdaIN
AutoVC
Pair 4
Source speaker
rms
Target speaker
bdl
Transcription
“ Bassett was a fastidious man. ”
Source utterance
Authentic utterance
from the target speaker
Conversion results

FragmentVC
with 10 randomly sampled target utterances
AdaIN
AutoVC

© 台大語音實驗室 NTU Speech Lab