Exploring Speech-to-Speech Models with Emotion Application

cliff · August 12, 2024, 10:38am

In experimenting with various text-to-speech (TTS) models and using RVC over multiple generations, a pattern has emerged: TTS systems often produce either natural-sounding speech with frequent mispronunciations or speech that, while accurate in pronunciation, comes across as robotic. With tools like RVC available, which enable voice cloning with any TTS, the question arises about the potential for speech-to-speech models to infuse emotion into these robotic-sounding TTS outputs.

Sadie · August 22, 2024, 7:07am

Give DeepMind’s WaveNet a consideration.

Terryanne · August 22, 2024, 7:08am

I’ve been quite impressed with Eleven Labs speech-to-speech, particularly because it does such an excellent job at mimicking emotional expression, but it all relies on the voice you pick.

I experimented with it for voice acting. https://youtu.be/b2Fd-Oleu2Y?si=HYoA3fmcy9ChfEUe

It is not open source, but it is the finest I’ve seen. Udio is fantastic, however it is designed for singing, and I am not sure if it can do speech-to-speech. I don’t think it’s what you desire.