Audio & Attention Map Demo

FRAGMENTVC: ANY-TO-ANY VOICE CONVERSION BY END-TO-END EXTRACTING AND FUSING FINE-GRAINED VOICE FRAGMENTS WITH ATTENTION

Abstract: Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios. In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms. By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end. This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn't require parallel data. Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC.

GitHub (Source Code) arXiv (Preprint)

Seen-to-seen conversion

In the following sections, there are 4 conversion pairs, each containing 4 speech utterances. The first 2 utterances are drawn from the CSTR VCTK Corpus:

An utterance from the source speaker, termed as source utterance
An utterance from the target speaker, which is of the same word transcription as the source utterance, termed as authentic utterance

And the rest 2 utterances are the conversion results using the source utterance as source:

A synthetic utterance generated with the authentic utterance as target
A synthetic utterance generated with 5 randomly sampled utterances from the target speaker as target

Pair 1

Source speaker: p225
Target speaker: p227
Transcription: “ Ask her to bring these things with her from the store. ”
Source utterance
Authentic utterance from the target speaker
Conversion result with the authentic utterance as target
Conversion result with 5 randomly sampled target utterances

Pair 2

Source speaker: p227
Target speaker: p225
Transcription: “ Many complicated ideas about the rainbow have been formed. ”
Source utterance
Authentic utterance from the target speaker
Conversion result with the authentic utterance as target
Conversion result with 5 randomly sampled target utterances

Pair 3

Source speaker: p228
Target speaker: p232
Transcription: “ Many complicated ideas about the rainbow have been formed. ”
Source utterance
Authentic utterance from the target speaker
Conversion result with the authentic utterance as target
Conversion result with 5 randomly sampled target utterances

Pair 4

Source speaker: p232
Target speaker: p228
Transcription: “ Ask her to bring these things with her from the store. ”
Source utterance
Authentic utterance from the target speaker
Conversion result with the authentic utterance as target
Conversion result with 5 randomly sampled target utterances

Unseen-to-unseen conversion

In the following sections, there are 4 conversion pairs, each containing 4 speech utterances. The first 2 utterances are drawn from the CMU Arctic dataset:

An utterance from the source speaker, termed as source utterance
An utterance from the target speaker, which is of the same word transcription as the source utterance, termed as authentic utterance

And the last one is the conversion result generated with the source utterance as source and 10 randomly sampled utterances from the target speaker as target.

Pair 1

Source speaker: slt
Target speaker: lnh
Transcription: “ The Warden with a quart of champagne. ”
Source utterance
Authentic utterance from the target speaker
Conversion results
FragmentVC with 10 randomly sampled target utterances
AdaIN
AutoVC

Pair 2

Source speaker: clb
Target speaker: rms
Transcription: “ The scents of strange vegetation blew off the tropic land. ”
Source utterance
Authentic utterance from the target speaker
Conversion results
FragmentVC with 10 randomly sampled target utterances
AdaIN
AutoVC

Pair 3

Source speaker: bdl
Target speaker: ljm
Transcription: “ The woman in you is only incidental, accidental, and irrelevant. ”
Source utterance
Authentic utterance from the target speaker
Conversion results
FragmentVC with 10 randomly sampled target utterances
AdaIN
AutoVC

Pair 4

Source speaker: rms
Target speaker: bdl
Transcription: “ Bassett was a fastidious man. ”
Source utterance
Authentic utterance from the target speaker
Conversion results
FragmentVC with 10 randomly sampled target utterances
AdaIN
AutoVC