Direct speech-to-speech translation (S2ST) systems leverage recent progress in speech representation learning, where a sequence of discrete representations (units) derived in a self-supervised manner, are predicted from the model and passed to a vocoder for speech synthesis, still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e.g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism. In this work, we propose TranSpeech, a speech-to-speech translation model with bilateral perturbation. To alleviate the acoustic multimodal problem, we propose bilateral perturbation, which consists of the style normalization and information enhancement stages, to learn only the linguistic information from speech samples and generate more deterministic representations. With reduced multimodality, we step forward and become the first to establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices and produces high17 accuracy results in just a few cycles. Experimental results on three language pairs demonstrate the state-of-the-art results by up to 2.5 BLEU points over the best publicly-available textless S2ST baseline. Moreover, TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
Audio samples are available at https://TranSpeech.github.io/.
Bilateral Perturbation
Utterances with the single-perturbed acoustic condition, remaining the linguistic content unchanged.
RR: random resampling. F: a chain function for random pitch shifting.
Text: really interesting work will finally be undertaken on that topic.
Original | Pitch Norm | Energy Norm | F = fs(pr(peq(x))) | F' = peq(fs(pr(x))) | RR |
---|---|---|---|---|---|
Text: an inter ministerial committee on disability was held a few weeks back.
Original | Pitch Norm | Energy Norm | F = fs(pr(peq(x))) | F' = peq(fs(pr(x))) | RR |
---|---|---|---|---|---|
Translation Results
DirectS2ST: Baseline model in Direct speech-to-speech translation with discrete units.
TextlessS2ST: Baseline model in Textless speech-to-speech translation on real data.
TranSpeech: TranSpeech with 5 mask-predict iterations.
En-Es
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (English) | Target (Spanish) | DirectS2ST | TextlessS2ST | TranSpeech | |
Sample 1: | |||||
Reference: | it can be found in algeria lebanon portugal and spain | se encuentra en argelia líbano portugal y españa | |||
ASR: | se encuentra en argeria libanía portugal y españa | se encuentra en argelia livana portugal y españa | se encuentra en argelia líbano portugal y españa | ||
Sample 2: | |||||
Reference: | afterwards they performed exhibitions in russia finland greece and poland | posteriormente realizó exposiciones en rusia finlandia grecia y polonia | |||
ASR: | o durante realidades exposiciones en rusia filandia finlandia y polonia | posteriormente realizaron exposiciones en rusia finlandia grecia y polonia | posteriormente realizó exposiciones en rusia finlandia grecia y poloni | ||
Sample 3: | |||||
Reference: | they are annual or perennial plants alpine and herbaceous | son plantas anuales o perennes alpinas y herbáceas | |||
ASR: | son plantas anuales o perennes y herbáceas | son plantas o plantas o perennes y rbáceas herbáceas | son plantas anuales o perennes alpinas y herbáceas |
En-Fr
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (English) | Target (French) | DirectS2ST | TextlessS2ST | TranSpeech | |
Sample 1: | |||||
Reference: | the town includes several villages and hamlets | la commune comprend plusieurs villages et hameaux | |||
ASR: | la commune comporte de nombreux villages et hameaux | la commune comprend plusieurs villages et hameaux | la commune comprend plusieurs villages et hameaux | ||
Sample 2: | |||||
Reference: | the dolmen is classified as a national monument by the irish state | le dolmen est classé comme monument national par l'état irlandais | |||
ASR: | le dolman est classé comme un monument national par l'état elandes | le domaine est classé comme monument national par l'état ilandais | le dolmen est classé comme monument national par l'état | ||
Sample 3: | |||||
Reference: | what is the committee's opinion favorable opinion on this precision | quel est l'avis de la commission avis favorable à cette précision | |||
ASR: | quel est l'avis de la commission avis favorable de cette précision | quel est l'avis de la commission avis favorable à cette précision | quel est l'avis de la commission avis favorable à cette précision |
Fr-En
Ground truth | Predictions | ||||
---|---|---|---|---|---|
Source (French) | Target (English) | DirectS2ST | TextlessS2ST | TranSpeech | |
Sample 1: | |||||
Reference: | un membre de la commission municipale du gouvernement provisoire | a member of the municipal commission of the provisional government | |||
ASR: | amint solimin ici pal commission of the provision | a member of the government commission the government takes place | a member of the municipal commission of the provisional government | ||
Sample 2: | |||||
Reference: | dans le nord de la france le terme correspondant est panne | in the north of france the corresponding term is penne | |||
ASR: | in the north of france the corresponding term is north of france | in the north of france the term corresponding is | in the north of france the corresponding term is north of france | ||
Sample 3: | |||||
Reference: | le château appartient à la famille noble de soirier | the castle belongs to the noble de soirier familyx | |||
ASR: | the castle belongs to the noble family of | the castle belongs to the noble desire family | the castle belongs to the noble desire family |