Coqui's XTTSv2 is good for this because it has a streaming mode. I have my own version of this where I got ~500ms end-to-end response latency, which is much faster than any other open source project I've seen. https://github.com/jdarpinian/chirpy
These are easy to make and fun to play with and it's awesome to have everything local. But it will take more to build something truly useable. A truly natural conversational AI needs to understand the nuances of conversation, most importantly when to speak and when to wait. It also needs to know subtleties of the user's voice that no speech recognizer can output, and it needs control over the output voice more precise than any TTS provides. Audio-to-audio models in the style of GPT-4o are clearly the way forward. (And someday soon, video-to-video models for video calling with a virtual avatar. And the step after that is robotics for physical avatars).
There aren't any open source audio-to-audio models yet but there are some promising approaches. https://ultravox.ai has the input half at least. https://tincans.ai/slm has a cool approach too.
I don't think SeamlessM4T qualifies as an end-to-end audio-to-audio model. The paper states "the task of speech-to-speech translation in SeamlessM4T v2 is broken down into speech-to-text translation (S2TT) and then text-to-unit conversion (T2U)". And while language translation is an important application as you mention, it's strictly limited to that. It wouldn't understand or produce non-speech audio (e.g. singing, music, environmental sounds, etc) and you can't have a conversation with it.
we have made an open source orchestration which enables you to plug in your own TTS/ASR/LLM for end-to-end voice conversations at -> https://github.com/bolna-ai/bolna.
We are also working on a complete open source stack for ASR+TTS+LLM and will be releasing it shortly.
Honestly, there are so many Project on Github doing STT - LLM - TTS that I lost count. The only revolutionary thing that feels like magic is if the STT supports Voice Activity Detection and low latency LLM inference on Groq, so conversations feel natural.
What we have learnt is that big enterprises do not really want to use close source models due to the random bursts in usage which might drain their bills.
I currently use Ollama + Openwebui for this. It also has a really serviceable voice mode. And it has many options like RAG integrations, custom models, memories to know you better, vision, a great web interface etc. But I'll have a look at this thing.
Too bad that the project is in limbo after Coqui (the company) folded. The license limits the use of the weights to non-commercial usage unless you buy a commercial license, and there's nobody left to sell you one now.
I don't know the details in this case, but it seems plausible that someone still owns the IP and thus might be in a position to initiate legal proceedings.
Honestly, I don't think that sounds as human as piper does, but that's probably a function of the voice model files more than anything, fe, en_US 'amy' sounds artificial, but hfc_female sounds more realistic on the Piper samples.
When I gave "Matt", my loyal local assistant[1], a voice xTTSv2 performed better for long form text. While in longform emotions seemed well balanced in the text, in short replies the emotion patterns frequently felt off and therefore unnatural. What I liked about xTTsv2 though is that voice cloning is fairly easy by just providing a .wav file with the intended voice pattern.
Coqui's XTTSv2 is good for this because it has a streaming mode. I have my own version of this where I got ~500ms end-to-end response latency, which is much faster than any other open source project I've seen. https://github.com/jdarpinian/chirpy
These are easy to make and fun to play with and it's awesome to have everything local. But it will take more to build something truly useable. A truly natural conversational AI needs to understand the nuances of conversation, most importantly when to speak and when to wait. It also needs to know subtleties of the user's voice that no speech recognizer can output, and it needs control over the output voice more precise than any TTS provides. Audio-to-audio models in the style of GPT-4o are clearly the way forward. (And someday soon, video-to-video models for video calling with a virtual avatar. And the step after that is robotics for physical avatars).
There aren't any open source audio-to-audio models yet but there are some promising approaches. https://ultravox.ai has the input half at least. https://tincans.ai/slm has a cool approach too.
> There aren't any open source audio-to-audio models yet
I think that's not true. See this for example: https://huggingface.co/facebook/seamless-m4t-v2-large It's not general purpose like GPT4o but translation still seems pretty useful
I don't think SeamlessM4T qualifies as an end-to-end audio-to-audio model. The paper states "the task of speech-to-speech translation in SeamlessM4T v2 is broken down into speech-to-text translation (S2TT) and then text-to-unit conversion (T2U)". And while language translation is an important application as you mention, it's strictly limited to that. It wouldn't understand or produce non-speech audio (e.g. singing, music, environmental sounds, etc) and you can't have a conversation with it.
I tried a similar project out last week, which uses Ollama, FastWhisperAPI, and MeloTTS: https://github.com/PromtEngineer/Verbi
Docker is a great option if you want lots of people to try out your project, but not many apps in this space come with a dockerfile
Ok, I need this but cloning Majel Barrett as the voice of the Enterprise computer.
Trivially done with a minute-long wav file. Simply specify the source sample in your june-va config.json
we have made an open source orchestration which enables you to plug in your own TTS/ASR/LLM for end-to-end voice conversations at -> https://github.com/bolna-ai/bolna.
We are also working on a complete open source stack for ASR+TTS+LLM and will be releasing it shortly.
Have you thought about support for the wyoming protocol? That would make it pretty much plug&play with home assistant.
Hadn't heard of the Wyoming Protocol before, but it's interesting, thanks for mentioning
For others who also hadn't heard of it, here's an overview: https://github.com/rhasspy/rhasspy3/blob/master/docs/wyoming...
As mentioned few days before, we just released the end to end open source stack at https://github.com/bolna-ai/bolna/tree/master/examples/whisp...
Honestly, there are so many Project on Github doing STT - LLM - TTS that I lost count. The only revolutionary thing that feels like magic is if the STT supports Voice Activity Detection and low latency LLM inference on Groq, so conversations feel natural.
What we have learnt is that big enterprises do not really want to use close source models due to the random bursts in usage which might drain their bills.
Today we released our full open source end to end ASR+LLM+TTS dockerized stack at ->
https://news.ycombinator.com/item?id=40789200
I currently use Ollama + Openwebui for this. It also has a really serviceable voice mode. And it has many options like RAG integrations, custom models, memories to know you better, vision, a great web interface etc. But I'll have a look at this thing.
Looks interesting! Is the latency low enough for it to feel natural? How's the Coqui speech quality?
It supports XTTSv2 which is currently the open-weight state of the art. So, pretty damn good (https://huggingface.co/coqui/XTTS-v2/blob/main/samples/en_sa...).
Too bad that the project is in limbo after Coqui (the company) folded. The license limits the use of the weights to non-commercial usage unless you buy a commercial license, and there's nobody left to sell you one now.
is there anyone to sue you? (how does that work?)
I don't know the details in this case, but it seems plausible that someone still owns the IP and thus might be in a position to initiate legal proceedings.
That same person could update the license, no?
Honestly, I don't think that sounds as human as piper does, but that's probably a function of the voice model files more than anything, fe, en_US 'amy' sounds artificial, but hfc_female sounds more realistic on the Piper samples.
https://rhasspy.github.io/piper-samples/
When I gave "Matt", my loyal local assistant[1], a voice xTTSv2 performed better for long form text. While in longform emotions seemed well balanced in the text, in short replies the emotion patterns frequently felt off and therefore unnatural. What I liked about xTTsv2 though is that voice cloning is fairly easy by just providing a .wav file with the intended voice pattern.
[1]https://open.substack.com/pub/jdsemrau/p/teaching-your-agent...
xTTS is notoriously bad at generating short samples. It will also hallucinate if you give it something short enough.
How does the STT compare to Fastwhisper?
How many RAM GB the model requires?
My very first Multimodal AI star on Github. Hope we see more of these in the future.
How long till a stand alone OS that makes AI usage its first class citizen?
[dead]
[flagged]