Audio Demo

This is the demo page for the paper Defending Your Voice: Adversarial Attack on Voice Conversion by Chien-yu Huang, Yist Y. Lin, Hung-yi Lee and Lin-shan Lee.

Abstract

Substantial improvements have been achieved in recent years in voice conversion, which converts the speaker characteristics of an utterance into those of another speaker without changing the linguistic content of the utterance. Nonetheless, the improved conversion technologies also led to concerns about privacy and authentication. It thus becomes highly desired to be able to prevent one's voice from being improperly utilized with such voice conversion technologies. This is why we report in this paper the first known attempt to try to perform adversarial attack on voice conversion. We introduce human imperceptible noise into the utterances of a speaker whose voice is to be defended. Given these adversarial examples, voice conversion models cannot convert other utterances so as to sound like being produced by the defended speaker. Preliminary experiments were conducted on two currently state-of-the-art zero-shot voice conversion models. Objective and subjective evaluation results in both white-box and black-box scenarios are reported. It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker, while the adversarial examples of the defended speaker are not distinguishable from the authentic utterances.

Voice Conversion Diagram

Demo Cases

Each of the following demo cases has two speakers involved. The utterance from the first speaker provides linguistic content. And the speaker characteristics of the second speaker are extracted by the voice conversion model. Then the conversion result is the combination of the content from the first speaker's utterance and the second speaker's speaker characteristics.

This demo aims to show that with a small amount of perturbation (\( \epsilon \)) applied to the second speaker's utterance, the voice conversion model fails to generate a result with the speaker characteristics of the second speaker, which implies that his/her voice is successfully defended.

In the following demo (where all the utterances are from CSTR VCTK Corpus), pXXX represents a speaker identity, and pXXX_YYY represents his/her utterance. E.g. p225 and p225_001.

Demo Case #1

	Content utterance
p374_249
	Speaker utterance	Conversion result
p333_334 original
p333_334 \( \epsilon = 0.05 \)
p333_334 \( \epsilon = 0.1 \)

Demo Case #2

	Content utterance
p302_133
	Speaker utterance	Conversion result
p287_286 original
p287_286 \( \epsilon = 0.05 \)
p287_286 \( \epsilon = 0.1 \)