SenseVoice Is Now Available on EveryScribe — Fast, Multilingual ASR for East Asian Languages

We're excited to announce that SenseVoice — the blazing-fast multilingual speech recognition model from Alibaba's FunAudioLLM team — is now the default ASR model in EveryScribe's Private Transcriber. You can start using it today at everyscribe.com, and it runs entirely inside your browser. No cloud upload. No server. Just your audio and your device.

What Is SenseVoice?

SenseVoice is a speech foundation model built for two things: speed and multilingual accuracy. Developed by Alibaba's FunAudioLLM team, it was trained on more than 400,000 hours of audio data — one of the largest training corpora of any publicly available ASR model. The result is a model that punches well above its weight class.

The variant we use on EveryScribe is SenseVoice-Small, a non-autoregressive end-to-end model fine-tuned specifically for five languages:

🇨🇳 Chinese (Mandarin)
🇬🇧 English
🇯🇵 Japanese
🇰🇷 Korean
🇭🇰 Cantonese (Yue)

Why SenseVoice Is Remarkably Fast

Most speech recognition models, including Whisper, use autoregressive decoding — they generate one token at a time, each step depending on the previous. SenseVoice-Small takes a fundamentally different approach: non-autoregressive decoding, where all output tokens are predicted in a single forward pass.

The practical result: SenseVoice-Small is approximately 7 to 15 times faster than Whisper Large and about 5 times faster than Whisper Small. On typical hardware, it can process 10 seconds of audio in as little as 70 milliseconds.

For EveryScribe users, this means you get near-instant transcripts even on modest laptops — and since the entire model fits in 250 MB, it loads quickly and doesn't exhaust your device's memory.

Accuracy That Beats Whisper for East Asian Languages

Speed doesn't come at the cost of accuracy. On multilingual speech recognition benchmarks, SenseVoice-Small matches or outperforms Whisper models across all five of its target languages. The gains are especially significant for:

Chinese (Mandarin): SenseVoice reports accuracy improvements of over 50% compared to Whisper on Chinese benchmarks.
Cantonese: One of the most consistently under-served languages in ASR. SenseVoice handles it natively, with dedicated training data and meaningful accuracy gains over general-purpose models.
Japanese and Korean: Clean performance with proper CJK character output, no romanization artifacts.

Beyond Transcription: Emotion and Audio Event Detection

SenseVoice goes further than most ASR models. The underlying architecture also powers:

Speech Emotion Recognition (SER): Detect whether a speaker sounds happy, sad, angry, or neutral. The large variant has achieved state-of-the-art results on nearly every tested emotion recognition dataset — without fine-tuning.
Audio Event Detection (AED): Identify sounds like applause, laughter, coughing, and background music within the audio stream.

These capabilities aren't exposed in the current EveryScribe interface, but they reflect the depth of the model underneath.

Completely Private, Completely Local

When you use SenseVoice on EveryScribe, your audio never leaves your device. The model runs via WebAssembly (WASM) directly inside your browser. There's no server request for your audio data — not to our infrastructure, not to Alibaba's, not to anyone else's. Once the 250 MB model is downloaded and cached in your browser, it works completely offline.

This makes SenseVoice on EveryScribe one of the most privacy-preserving ways to transcribe Mandarin, Japanese, Korean, and Cantonese content available anywhere.

When to Use SenseVoice

SenseVoice is our recommended default for most users. It's the right choice if:

Your audio is in Chinese, Cantonese, Japanese, Korean, or English
You want the fastest possible transcription on your device
You're transcribing meetings, interviews, podcasts, or lectures in any of the five supported languages
Privacy matters — your files shouldn't leave your machine

If you're working with a language outside these five, check out our other models like Dolphin (40 Asian languages), Omnilingual (1,600 languages), or Moonshine (English-only, ultra-light).

Get Started

Open everyscribe.com/dashboard/offline-transcriber, select SenseVoice as your ASR model (it's the default), download it once, and start transcribing. Your files stay on your device.

SenseVoice is open-source and available on GitHub and Hugging Face. We're grateful to the FunAudioLLM team for releasing it under a permissive license that makes projects like EveryScribe possible.