FunASR Nano Is Now Available on EveryScribe — LLM-Powered ASR for Chinese, Japanese, and Dialects

трав 14, 2025

FunASR Nano — the LLM-enhanced speech recognition model from Tongyi Lab — is now available in EveryScribe's Private Transcriber. Powered by a Qwen3-0.6B language model decoder, this is our most accurate model for Chinese, Japanese, and Chinese dialect transcription. It also happens to run entirely in your browser.

What Is FunASR Nano?

FunASR Nano is a hybrid ASR architecture that combines a conventional speech encoder with a large language model decoder. Specifically:

  • Audio Encoder: A 200M-parameter speech encoder that converts audio into high-level acoustic representations
  • LLM Decoder: Qwen3-0.6B (~600M parameters), one of Alibaba's latest open-source language models, which decodes those representations into text

Total: approximately 800M parameters, making it the most capable model in EveryScribe's Private Transcriber lineup.

The insight behind this design is important: language models understand context, grammar, and semantics in ways that traditional ASR decoders don't. By pairing a strong audio encoder with an LLM decoder, FunASR Nano can correct acoustically ambiguous inputs using linguistic context — the same way a human listener fills in words they couldn't quite hear.

Language Support: Chinese Depth No Other Model Matches

FunASR Nano's standout feature is its treatment of Chinese linguistic diversity:

  • Standard Mandarin (Putonghua)
  • 7 major Chinese dialect families: Cantonese (Yue), Shanghainese (Wu), Hokkien (Min), Hakka, Gan, Xiang, Jin
  • 26 regional accents within Mandarin and dialect groups
  • Code-switching: Mixed Chinese-English speech handled natively
  • English (standalone and mixed with Chinese)
  • Japanese

For context: most ASR models that claim "Chinese support" mean standard Mandarin only. FunASR Nano genuinely handles the full diversity of Chinese speech — including dialects that represent hundreds of millions of speakers.

What the LLM Brings to ASR

Traditional ASR systems struggle with:

  • Acoustically ambiguous homonyms (particularly common in Chinese where many syllables have multiple meanings)
  • Incomplete sentences and disfluencies in natural speech
  • Technical vocabulary in specialized domains
  • Far-field and noisy recording conditions

The Qwen3-0.6B decoder addresses all of these through language modeling. It knows that a word acoustically close to "tā" in a medical context probably means something different than in a casual conversation. It understands that "yínháng" should be written as 银行 (bank) and not 因行 (walk due to). Context resolves ambiguity.

FunASR Nano has demonstrated up to 93% accuracy in high-noise, far-field conditions — recordings made at a distance from the microphone, with background noise — where conventional models degrade significantly.

Specialized Domain Performance

The model was specifically evaluated and tuned for:

  • Education: Lectures, classroom recordings, online courses
  • Finance: Earnings calls, market analysis, financial advisory content
  • Medical: Clinical notes, consultations, medical education
  • Entertainment: Lyric recognition and rap speech recognition — unusual capabilities that reflect training data diversity

Lyric and rap recognition are particularly interesting: sung speech and rapid rhythmic speech are notoriously difficult for ASR systems. FunASR Nano handles both.

Punctuation and Text Normalization

FunASR Nano outputs properly punctuated Chinese text — Chinese-style punctuation (,。!?《》) directly from the model, plus intelligent handling of numbers, dates, and amounts in their written form rather than their digit form. A sum of "yīqiān liùbǎi wǔshí yuán" comes out as 1,650元, not a phonetic transliteration.

980 MB, Runs in Your Browser

FunASR Nano is our heaviest model at 980 MB, reflecting its architecture. The extra download cost is worth it for users who need the highest accuracy Chinese, Japanese, or dialect transcription available in a browser.

Like all EveryScribe Private Transcriber models, it runs via WebAssembly in your browser. Your audio stays on your device. Once downloaded, it works without an internet connection.

Note on speed: Because FunASR Nano uses an LLM decoder, it is slower than our pure-encoder models (SenseVoice, Dolphin). For real-time feedback on short audio clips it works well; for very long recordings, expect it to take longer than SenseVoice or Moonshine.

When to Choose FunASR Nano

FunASR Nano is the model for you if:

  • You need the highest possible accuracy for Chinese (Mandarin or dialects)
  • Your audio contains Chinese dialects — Cantonese, Shanghainese, Hokkien, Hakka, etc.
  • You're transcribing mixed Chinese-English (code-switching) speech
  • Your recording environment is noisy or far-field
  • You're transcribing domain-specific content in finance, medicine, or education
  • Accuracy matters more than transcription speed

For faster Chinese transcription with slightly lower accuracy, SenseVoice remains an excellent choice.

Get Started

Visit everyscribe.com/dashboard/offline-transcriber, select FunASR Nano from the ASR model dropdown, and download it once. The most advanced Chinese and dialect ASR available in a browser — running entirely on your device.


FunASR Nano is developed by Tongyi Lab (Alibaba DAMO Academy) and is available on GitHub. The Qwen3-0.6B decoder is from Qwen. We thank both teams for their commitment to open-source LLM and ASR research.

The EveryScribe Team

The EveryScribe Team