The title is dense and the paper is short. But the demo is outstanding: (https://huggingface.co/spaces/aiola/whisper-ner-v1). The sample audio is submitted with "entity labels" set to "football-club, football-player, referee" and WhisperNER returns tags Arsenal and Juventus for the football-club tag. They suggest "personal information" as a tag to try on audio.
Impressive, very impressive. I wonder if it could listen for credit cards or passwords.
I think one of the biggest advantages is the security/privacy benefit — you can see in the demo that the model can mask entities instead of tagging. This means that instead of transcribing and then scrubbing sensitive info, you can prevent the sensitive info from ever being transcribed.
Another potential benefit is in lower latency. The paper doesn't specifically mention latency but it seems to be on par with normal Whisper, so you save all of the time it would normally take to do entity tagging — big deal for real-time applications
Almost definitely. You can think of there being a type of triangle inequality for cascading different systems where manually combined systems almost always perform worse given comparable data and model capacity. Alternatively you have tied the models hands by forcing it to bottleneck through a representation you chose.
"The model processes audio files and simultaneously applies NER to tag or mask specific types of sensitive information directly within the transcription pipeline. Unlike traditional multi-step systems, which leave data exposed during intermediary processing stages, Whisper-NER eliminates the need for separate ASR and NER tools, reducing vulnerability to breaches."
On a similar note, I've a request for the HN community. Can anyone recommend a low-latency NER model/service.
I'm building an assistant that gives information on local medical providers that match your criteria. I'm struggling with query expansion and entity recognition. For any incoming query, I would want to NER for medical terms (which are limited in scope and pre-determined), and subsequently where I would do Query rewriting and expansion.
The title is dense and the paper is short. But the demo is outstanding: (https://huggingface.co/spaces/aiola/whisper-ner-v1). The sample audio is submitted with "entity labels" set to "football-club, football-player, referee" and WhisperNER returns tags Arsenal and Juventus for the football-club tag. They suggest "personal information" as a tag to try on audio.
Impressive, very impressive. I wonder if it could listen for credit cards or passwords.
It's so great to see that we finally move away from the thirty year old triple categorization of people, organizations and locations.
This of course means that we now have to think about all the irreconcilable problems of taxonomy, but I'll take that any day over the old version :)
GitHub repo: https://github.com/aiola-lab/whisper-ner
Hugging Face Demo: https://huggingface.co/spaces/aiola/whisper-ner-v1
Pretty good article that focuses on the privacy/security aspect of this — having a single model that does ASR and NER:
https://venturebeat.com/ai/aiola-unveils-open-source-ai-audi...
Wouldnt it be better to run normal Whisper and NER on top of the transcription before streaming a response or writing anything to disk?
What advantage does this offer?
I think one of the biggest advantages is the security/privacy benefit — you can see in the demo that the model can mask entities instead of tagging. This means that instead of transcribing and then scrubbing sensitive info, you can prevent the sensitive info from ever being transcribed. Another potential benefit is in lower latency. The paper doesn't specifically mention latency but it seems to be on par with normal Whisper, so you save all of the time it would normally take to do entity tagging — big deal for real-time applications
Yeah, I’m also curious about that. Does combining ASR and NER into one model improve performance for either?
Almost definitely. You can think of there being a type of triangle inequality for cascading different systems where manually combined systems almost always perform worse given comparable data and model capacity. Alternatively you have tied the models hands by forcing it to bottleneck through a representation you chose.
Looks like only inference available and no fine tuning code available
"The model processes audio files and simultaneously applies NER to tag or mask specific types of sensitive information directly within the transcription pipeline. Unlike traditional multi-step systems, which leave data exposed during intermediary processing stages, Whisper-NER eliminates the need for separate ASR and NER tools, reducing vulnerability to breaches."
On a similar note, I've a request for the HN community. Can anyone recommend a low-latency NER model/service.
I'm building an assistant that gives information on local medical providers that match your criteria. I'm struggling with query expansion and entity recognition. For any incoming query, I would want to NER for medical terms (which are limited in scope and pre-determined), and subsequently where I would do Query rewriting and expansion.