Dolphin is now available in EveryScribe's Private Transcriber. At just 238 MB, it covers more linguistic ground than almost any ASR model of its size — 40 Eastern languages spanning East Asia, South Asia, Southeast Asia, and the Middle East, plus 22 Chinese dialects that most speech recognition systems simply ignore.
What Is Dolphin?
Dolphin is a compact CTC-Attention ASR model built for multilingual coverage of Asian and Middle Eastern languages. It was developed with a specific goal: close the accuracy gap that has historically existed between Western and Eastern language ASR — the gap where Whisper and similar English-first models consistently underperform.
The model uses an E-Branchformer encoder (a more efficient variant of the popular Conformer architecture) combined with a Transformer decoder. Its CTC branch is what we use on EveryScribe, enabling fast, offline inference without requiring an autoregressive decoder pass for every token.
40 Languages, 22 Chinese Dialects
Dolphin's language coverage is exceptional for its size:
East Asia: Mandarin, Cantonese, Japanese, Korean, and more
South Asia: Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati, Marathi, and others
Southeast Asia: Vietnamese, Thai, Indonesian, Malay, Tagalog, Burmese, Khmer, and more
Middle East: Arabic, Persian (Farsi), Turkish, and Hebrew
Beyond standard languages, Dolphin supports 22 Chinese dialects including Shanghainese, Hokkien, Teochew, Hakka, Cantonese variants, and regional accents across mainland China and Taiwan. This is rare. Most ASR models treat "Chinese" as a monolith. Dolphin doesn't.
The model uses a clever two-level language token system: the first token identifies the language (e.g., <zh>, <ja>), and the second token specifies the region or dialect (e.g., <CN>, <TW>, <SH> for Shanghainese). This architecture allows it to cleanly disambiguate between closely related language varieties.
Accuracy: Better Than Whisper Large-v3 for Eastern Languages
On multilingual benchmarks covering the targeted language groups, Dolphin consistently outperforms Whisper Large-v3 — a model three times its size — particularly for:
- Chinese dialects (where Whisper has minimal training data)
- South Asian languages like Hindi and Bengali
- Southeast Asian languages like Vietnamese and Thai
The comparison matters because Whisper Large-v3 is the standard benchmark model most people reach for when they need multilingual ASR. Dolphin being more accurate on Eastern languages, at a fraction of the size and download cost, makes it the correct choice for these use cases.
Beyond ASR: Built-in Language ID and VAD
Dolphin's architecture includes capabilities beyond speech recognition:
- Language Identification (LID): Detect which language is being spoken without any configuration
- Voice Activity Detection (VAD): Segment audio into speech vs. non-speech regions
- Dialect segmentation: Separate audio regions by language variant when processing mixed-dialect recordings
238 MB, Runs Entirely in Your Browser
The Dolphin variant we ship is a quantized INT8 ONNX model, compressed to 238 MB — the smallest model in our lineup. It runs via WebAssembly in your browser and never transmits your audio anywhere. Once downloaded, it works with no internet connection.
For researchers, journalists, linguists, or anyone working with audio in minority or regional Asian languages, this kind of local, offline capability is genuinely rare.
When to Choose Dolphin
Dolphin is the right model when:
- Your audio is in a South or Southeast Asian language (Hindi, Thai, Vietnamese, Indonesian, etc.)
- You're transcribing Chinese dialects — Cantonese, Shanghainese, Hokkien, Hakka, etc.
- You need broad Asian language coverage in a single small model
- You're working with mixed-language audio across the Asian region
- You want the smallest download size in our lineup
For purely East Asian languages (ZH/EN/JA/KO), SenseVoice offers slightly higher accuracy. For global language coverage including European languages, see Omnilingual.
Get Started
Head to everyscribe.com/dashboard/offline-transcriber, select Dolphin from the ASR model dropdown, and download it once. All transcription runs locally in your browser.
Dolphin is an open-source model available on Hugging Face and supported via the sherpa-onnx GitHub repository. We're grateful to the creators for releasing this model to improve Asian language transcription access.
