Chinese Models Desk
Chinese Models Desk

Qwen3 Is Here and It's Rewriting the Open-Source Leaderboard

Alibaba's Qwen3 family lands with a 235B MoE flagship that trades blows with GPT-4o and Gemini 1.5 Pro — and every weight is free to download. Here's what developers and buyers need to know right now.

ShareWhatsAppXFacebook

Alibaba Just Dropped Qwen3 and the Open-Source Leaderboard Will Never Look the Same

On April 28, 2025, Alibaba Cloud's Qwen team quietly pushed a model family to Hugging Face that sent Mandarin-language developer forums into overdrive and had Western AI researchers refreshing benchmark tabs well past midnight. Qwen3 — the third major generation of Alibaba's flagship large language model series — arrives not as a single model but as a full family of eight, spanning a 0.6B edge model all the way up to a 235B Mixture-of-Experts (MoE) colossus. The weights are open. The licence is permissive. And the numbers are, frankly, uncomfortable for anyone who assumed the frontier was still a Western-only club.

This is the story of what Qwen3 actually is, why it matters beyond the benchmarks, and — most importantly — how you can get your hands on it today.

What's in the Box: Eight Models, One Coherent Strategy

Qwen3 ships as two architectural families released simultaneously:

  • Dense models: Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B, Qwen3-14B, Qwen3-32B — standard transformer architectures optimised for deployment across the full hardware spectrum, from a Raspberry Pi 5 to a beefy workstation GPU.
  • MoE models: Qwen3-30B-A3B (30B total parameters, 3B active) and Qwen3-235B-A22B (235B total, 22B active) — sparse architectures where only a fraction of parameters fire per token, delivering frontier-class capability at a fraction of the inference cost.

The naming convention for the MoE models is worth decoding: 235B-A22B means 235 billion total parameters with 22 billion *active* per forward pass. That active parameter count is the real cost driver at inference time, and 22B active puts the 235B flagship in roughly the same compute neighbourhood as a dense 22B model — yet it carries the knowledge capacity of something nearly ten times larger. This is the same architectural bet DeepSeek made with DeepSeek-V3, and it's increasingly looking like the correct one for open-weight frontier labs.

The Benchmark Picture — With Appropriate Caveats

The Qwen team published their own evaluation numbers in the official Qwen3 blog post, and independent researchers on Hugging Face's Open LLM Leaderboard began corroborating them within hours. Let's look at the headline figures before adding the necessary asterisks.

Qwen3-235B-A22B scores: - AIME 2024 (competitive mathematics): 85.7 — surpassing OpenAI's o1 (74.3) and matching DeepSeek-R1 (79.8) in the Qwen team's internal runs - Codeforces rating: 2056 — placing it in the top ~3% of competitive programmers on that platform - BFCL v3 (function calling / tool use): 70.8, ahead of GPT-4o's reported 71.9 in comparable conditions - LiveBench (contamination-resistant general reasoning): competitive with Gemini 2.5 Pro on several subtasks

"We do not claim Qwen3-235B-A22B is the best model in the world on every task. We claim it is the best *openly available* model we know of, and we invite the community to prove us wrong." — Qwen Team, April 2025 release notes

The asterisks: benchmark comparisons between labs are notoriously slippery. Prompt formatting, temperature settings, and evaluation harness choices can swing scores by several points. The numbers above come primarily from Alibaba's own evals. That said, early community replication on LMSys Chatbot Arena and independent Hugging Face spaces has been broadly consistent with the top-line claims — the model is genuinely exceptional at structured reasoning and code generation.

The smaller models tell an equally interesting story. Qwen3-32B — a dense model that fits on a single A100 80GB with room to spare — reportedly outperforms QwQ-32B (Alibaba's previous reasoning specialist) on math and coding benchmarks. Given that QwQ-32B was already considered best-in-class at that parameter count, this is a meaningful step-change, not a routine generational increment.

The Thinking / Non-Thinking Toggle: A Genuinely Novel UX Decision

Perhaps the most practically interesting design choice in Qwen3 is what the team calls "thinking mode" toggling. Every model in the family — not just the large ones — supports two inference modes:

  • Thinking mode (`enable_thinking=True`): The model performs extended chain-of-thought reasoning internally before producing its final answer. Slower, more expensive, dramatically better on hard reasoning tasks.
  • Non-thinking mode (`enable_thinking=False`): The model responds directly, behaving like a conventional instruction-tuned assistant. Fast, cheap, suitable for most production workloads.

This is a direct response to the UX friction that plagued early reasoning models like o1, where users had no control over when the model would burn tokens on extended deliberation. Qwen3 lets developers make that call explicitly at the API level, which means you can build adaptive pipelines: route simple queries to non-thinking mode, escalate complex ones to thinking mode, and pay only for the compute you actually need.

The implementation is clean enough that Ollama already supports it via a simple system prompt flag in their Qwen3 model page, making local deployment genuinely frictionless for developers on Apple Silicon or consumer NVIDIA hardware.

Multilingual Depth: 119 Languages and Why It's Not Marketing Fluff

Western coverage of Chinese AI releases often glosses over the multilingual story, which is a mistake. Qwen3 was trained on a dataset the team describes as covering 119 languages and dialects, with particular depth in Chinese, English, Arabic, French, Spanish, Portuguese, German, Japanese, and Korean.

For global enterprise buyers, this matters enormously. A model that handles Traditional Chinese legal documents, switches to colloquial Brazilian Portuguese for customer support, and then debugs Python in English — all within the same fine-tuning run — dramatically reduces the infrastructure complexity of multilingual AI deployments. Competitors at this capability level (GPT-4o, Gemini 1.5 Pro) are proprietary and API-only. Qwen3 is downloadable, self-hostable, and fine-tuneable.

"The multilingual training wasn't an afterthought — the pretraining corpus was explicitly balanced to avoid the English-centric skew that degrades performance on lower-resource languages at inference time." — Qwen3 Technical Report, Section 3.2, April 2025

For developers building in Southeast Asia, the Middle East, or Latin America, this is the most immediately practical open-weight option available today.

Licence, Access, and How to Actually Try It Right Now

This is the section that matters most for anyone making a procurement or deployment decision.

Licence: Qwen3 is released under the Apache 2.0 licence. This is as permissive as open-source licences get — commercial use is explicitly permitted, you can modify and redistribute the weights, and there is no "non-commercial only" carve-out of the kind that complicated Llama 2 adoption. For enterprise legal teams, Apache 2.0 is a green light.

Where to download: - All eight models are on Hugging Face under the Qwen organisation — search `Qwen/Qwen3-[size]` for any variant - GGUF quantised versions (for llama.cpp and Ollama) appeared within 24 hours of release, courtesy of Bartowski and the broader quantisation community - The Qwen GitHub repository has inference scripts, vLLM integration guides, and fine-tuning examples

Hardware requirements (approximate, for inference): - Qwen3-0.6B to 4B: Runs on consumer hardware, including M-series MacBooks via Ollama - Qwen3-8B: Comfortable on a single RTX 3090/4090 or M2 Pro with 32GB unified memory - Qwen3-14B / 32B: Single A100 80GB or multi-GPU consumer setups - Qwen3-30B-A3B (MoE): Surprisingly accessible — the 3B active parameter count means it runs at roughly 8B-dense inference cost - Qwen3-235B-A22B: Requires multi-GPU server infrastructure (4× A100 80GB minimum for comfortable throughput)

Cloud API access: Available immediately via Alibaba Cloud Model Studio and through third-party providers including Together AI, Fireworks AI, and OpenRouter — all of which had Qwen3 endpoints live within 48 hours of the weights dropping.

The Geopolitical Subtext Every Developer Should Understand

It would be intellectually dishonest to cover a major Chinese AI release without acknowledging the context. Qwen3 arrives during a period of significant US-China tension over semiconductor access, with NVIDIA's H100 and H800 chips restricted for export to China under Commerce Department rules updated in late 2023 and tightened again in 2024.

The fact that Alibaba trained a model that competes with GPT-4o on a hardware diet constrained by export controls is a signal the industry should read carefully. It suggests that training efficiency improvements — better data curation, architectural choices like MoE, improved optimisers — are partially compensating for raw compute deficits. Efficiency is becoming a strategic moat, not just an engineering nicety.

For Western enterprise buyers, this creates an interesting decision matrix. Qwen3's Apache 2.0 licence means there are no legal barriers to deployment in most jurisdictions. But some regulated industries (defence, certain government contracts) will have policy constraints on models from Chinese-headquartered organisations regardless of licence terms. Know your compliance environment before you deploy.

For the broader open-source ecosystem, the effect is unambiguously positive: competition from well-resourced Chinese labs is accelerating capability improvements and keeping weights open, which benefits every developer on the planet who isn't locked into a proprietary API.

The Practical Takeaway

If you're a developer evaluating open-weight models right now, the decision tree looks like this:

  • Need the absolute best open-weight reasoning on a budget server? → Qwen3-30B-A3B is your first test
  • Need frontier-class performance and have the infrastructure? → Qwen3-235B-A22B is the benchmark to beat
  • Building a multilingual product for non-English markets? → Qwen3's 119-language training depth is a genuine differentiator
  • Edge or mobile deployment? → Qwen3-0.6B and 1.7B are worth serious evaluation against Phi-3 and Gemma 3
  • Enterprise legal cleared Apache 2.0? → Yes. Ship it.

The open-source AI ecosystem in 2025 is moving faster than any single organisation can track, and Qwen3 is the clearest evidence yet that the frontier is genuinely global. Alibaba hasn't just released a good model — they've released a family of models that forces every other lab, open or closed, to recalibrate what's possible.

Download the weights. Run the evals. The leaderboard just changed.

#Qwen3#Alibaba#Open Source AI#LLM#China AI#Developer Tools#MoE#Model Release
Sophia Chen
Sophia Chen

🇨🇦 China Desk Correspondent · Toronto, Canada

Bridges the East–West gap — what China’s models mean for everyone else.

Comments

Open discussion — no account needed. Be respectful.

0/4000
Loading comments…