Back to News

Alibaba Introduces Qwen Model to Elevate AI Transcription Capabilities

Tuesday, Sep 9, 2025

Alibaba Introduces Qwen Model to Elevate AI Transcription Capabilities

The realm of AI speech transcription is set for a shake-up with Alibaba’s introduction of the Qwen3-ASR-Flash model from their Qwen team.

This model, rooted in the advanced Qwen3-Omni framework, has been developed with an extensive dataset encompassing tens of millions of hours of speech data. It's not merely another addition to AI speech recognition; according to the team, it excels in precision, even amidst challenging acoustic settings or intricate language uses.

How does it fare against its peers? Data from trials conducted in August 2025 indicate its impressive capabilities.

In public testing for standard Chinese, Qwen3-ASR-Flash recorded an error rate of merely 3.97 percent, outshining rivals like Gemini-2.5-Pro at 8.98% and GPT4o-Transcribe at 15.72%, heralding a new era for competitive AI speech transcription solutions.

The model also showcased its skill with Chinese accents, attaining an error rate of 3.48 percent. For English, it achieved a competitive rate of 3.81 percent, surpassing Gemini’s 7.63 percent and GPT4o’s 8.45 percent comfortably.

However, its most astonishing performance is observed in transcribing music.

When it came to recognizing lyrics, Qwen3-ASR-Flash recorded only a 4.51 percent error rate, surpassing its competitors significantly. This proficiency extends to internal tests on full songs, where it achieved a 9.96 percent error rate, a dramatic enhancement over Gemini-2.5-Pro's 32.79 percent and GPT4o-Transcribe’s 58.59 percent.

In addition to its remarkable accuracy, the model introduces groundbreaking features for next-gen AI transcription tools, notably its adaptable contextual biasing.

Say goodbye to laboriously formatted keyword lists; this system allows users to input context in virtually any format, yielding bespoke results. Whether it’s a simple keyword list, complete documents, or a chaotic combination of both, the model adapts.

This advancement removes the need for complex contextual information preprocessing. The model is proficient in leveraging context for heightened accuracy, yet its general performance remains largely unhindered even if irrelevant text is provided.

Alibaba envisions this AI model as a global tool for speech transcription. Capable of delivering accurate transcriptions from a single model across 11 languages, it also accommodates numerous dialects and accents.

Its support for Chinese is particularly comprehensive, including Mandarin alongside major dialects like Cantonese, Sichuanese, Minnan (Hokkien), and Wu.

For English speakers, it adeptly handles British, American, and other regional accents. Other supported languages comprise French, German, Spanish, Italian, Portuguese, Russian, Japanese, Korean, and Arabic.

Additionally, the model can accurately determine the spoken language among the 11 offered and proficiently dismiss non-speech elements such as silence or background noise, ensuring a cleaner output compared to previous AI speech transcription tools.

Latest News

Here are some news that you might be interested in.