Chinese Models Desk
Chinese Models Desk

Qwen3 Arrives: Alibaba's Hybrid-Thinking Model Family Rewrites the Open-Weight Benchmark Map

Alibaba's Qwen Team has dropped Qwen3, a family of eight models spanning 0.6B to 235B parameters, featuring a novel hybrid thinking mode that lets users toggle chain-of-thought reasoning on or off at inference time. The release lands squarely in the middle of China's most competitive open-weight moment yet.

ShareWhatsAppXFacebook

Qwen3 Is Here — and It Changes the Calculus for Open-Weight AI in China

On 28 April 2025, Alibaba's Qwen Team published the Qwen3 model family to Hugging Face and ModelScope simultaneously, alongside a detailed technical blog post in both Chinese and English. The release comprises eight models — four dense and four Mixture-of-Experts (MoE) variants — ranging from 0.6B to 235B parameters. The flagship, Qwen3-235B-A22B, is a MoE architecture that activates 22 billion parameters per forward pass. Every weight in the family is released under the Apache 2.0 licence, meaning unrestricted commercial use with no application required.

This is not an incremental update. Qwen3 introduces what the team calls "hybrid thinking mode" — a single model that can operate in either a slow, chain-of-thought reasoning mode ("thinking mode") or a fast, direct-response mode ("non-thinking mode"), switchable at inference time via a chat template flag. That architectural choice has significant downstream implications for deployment cost and user experience that Western coverage has largely glossed over.

---

The Model Lineup in Full

Understanding the release requires holding the full matrix in your head:

  • Qwen3-0.6B — Dense, 0.6B parameters. Targets on-device and edge inference.
  • Qwen3-1.7B — Dense, 1.7B parameters. Embedded and mobile use cases.
  • Qwen3-4B — Dense, 4B parameters. Matches or exceeds prior Qwen2.5-72B on several reasoning tasks per internal evals.
  • Qwen3-8B — Dense, 8B parameters. The workhorse mid-range model.
  • Qwen3-14B — Dense, 14B parameters.
  • Qwen3-32B — Dense, 32B parameters. Competitive with DeepSeek-R1 at the 32B weight class on AIME 2024.
  • Qwen3-30B-A3B — MoE, 30B total / 3B active. Extremely efficient inference profile.
  • Qwen3-235B-A22B — MoE, 235B total / 22B active. Flagship.

All models support a 128K token context window and are multilingual across 119 languages and dialects — a figure the team attributes to a training corpus that explicitly targets low-resource language coverage, an area where most Western frontier labs still underinvest.

---

Hybrid Thinking: The Architectural Bet That Matters

The most technically interesting decision in Qwen3 is not the parameter count — it is the hybrid thinking architecture. Previous reasoning-specialised models from Chinese labs (DeepSeek-R1, QwQ-32B) were trained exclusively for chain-of-thought output. They are slow and expensive at inference time by design. Qwen3 trains a single model to do both.

"We enable Qwen3 models to seamlessly switch between 'thinking mode', for complex logical reasoning, math, and coding tasks, and 'non-thinking mode', for efficient, general-purpose dialogue. Users can control this via the enable_thinking flag in the chat template." > — Qwen Team technical blog, 28 April 2025

In practice, this means a developer deploying Qwen3-8B can route simple customer queries to non-thinking mode (low latency, low cost) and route complex coding or maths tasks to thinking mode — all within a single loaded model. No separate model serving infrastructure, no routing layer to a specialist model. For startups and independent developers running on constrained GPU budgets, this is a material operational advantage.

The training pipeline to achieve this is described in four stages in the technical report: (1) long-context pretraining on 36 trillion tokens, (2) reasoning-specialised supervised fine-tuning, (3) reinforcement learning for thinking mode, and (4) a final fusion stage that merges thinking and non-thinking capabilities without catastrophic forgetting. Stage 4 is the novel contribution; the team describes it as "thinking budget" annealing, gradually reducing enforced chain-of-thought length during training until the model learns to self-regulate.

---

Benchmark Position: Reading the Numbers Carefully

Alibaba's own benchmark table shows Qwen3-235B-A22B scoring:

  • 85.7 on AIME 2024 (mathematics olympiad problems)
  • 59.1 on LiveCodeBench (competitive programming)
  • 79.0 on GPQA Diamond (graduate-level science)

For context, OpenAI's o3-mini scores approximately 79.6 on AIME 2024 under comparable pass@1 settings, and DeepSeek-R1 (671B total, 37B active) scores 79.8. Qwen3-235B-A22B, with far fewer active parameters, appears to match or exceed both on this specific benchmark.

The more striking number is at the smaller end: Qwen3-4B reportedly matches Qwen2.5-72B on several reasoning benchmarks. If reproducible by third parties — and that is an important caveat; independent replication is still in progress as of this writing — it represents roughly an 18x parameter efficiency gain over the course of a single model generation. That trajectory, if it holds, has serious implications for inference economics across the entire Chinese cloud market.

  • AIME 2024: Qwen3-235B-A22B 85.7 vs DeepSeek-R1 79.8 vs o3-mini ~79.6
  • LiveCodeBench v5: Qwen3-235B-A22B 59.1 vs Claude 3.7 Sonnet ~58.7
  • GPQA Diamond: Qwen3-235B-A22B 79.0 vs Gemini 2.5 Pro ~84.0 (Gemini retains a lead here)
  • Multilingual: 119 languages supported, versus 29 for the original Qwen2.5-72B

Readers should treat all vendor-reported benchmarks with appropriate scepticism until independent evaluations are published. The Open LLM Leaderboard and EleutherAI's eval harness are the relevant third-party references to watch over the coming weeks.

---

Domestic Context: Where Qwen3 Lands in China's AI Race

Western coverage tends to frame Chinese AI releases in terms of US-China competition. The more immediately relevant frame is intra-China competition, which is ferocious right now.

DeepSeek set the pace in January 2025 with R1 and its genuinely disruptive open-weight release. Zhipu AI followed with GLM-4 updates and its own reasoning variants. Moonshot AI (月之暗面) has been aggressive on context length with its Kimi model series. ByteDance released Doubao-1.5-pro with strong coding benchmarks in February. And Baidu continues to iterate on ERNIE.

Alibaba's response with Qwen3 is notable for two strategic choices:

  • Refusing to fragment the model family: Rather than releasing a separate "Qwen3-Reasoner" product, the team baked reasoning into every model in the family. This is a direct counter to the ecosystem fragmentation that has made DeepSeek's lineup (V3, R1, R1-Zero, R1-Distill variants) somewhat confusing for developers to navigate.
  • Doubling down on open weights at scale: The 235B MoE being Apache 2.0 is a significant commitment. It signals that Alibaba views ecosystem capture — getting Qwen into downstream products, fine-tunes, and enterprise deployments — as more strategically valuable than API revenue protection at the frontier model tier.
"开源不是慈善,是生态战略。" ("Open source is not charity, it is ecosystem strategy.") > — Commonly attributed framing within Alibaba Cloud's developer relations team, widely circulated in Chinese AI developer communities on 知乎 (Zhihu) following the release.

This framing maps directly onto how Alibaba Cloud monetises the Qwen family: the weights are free, but enterprises running Qwen at scale on Alibaba Cloud's Model Studio infrastructure pay for compute, fine-tuning pipelines, and managed deployment. Open-weight is the acquisition funnel; cloud compute is the revenue line.

---

Training Data and the 36-Trillion-Token Corpus

The Qwen3 technical report states pretraining on 36 trillion tokens, up from approximately 18 trillion for Qwen2.5. The composition is not fully disclosed, but the team notes:

  • Increased proportion of STEM-domain text (mathematics, science, code)
  • Expanded multilingual coverage, with explicit curation for 119 languages
  • Synthetic data generated by prior Qwen models used to bootstrap reasoning-domain coverage

The use of model-generated synthetic data for reasoning training is now standard practice across both Chinese and Western frontier labs — DeepSeek-R1-Zero demonstrated that pure RL on reasoning tasks without human-annotated chain-of-thought can produce competitive results. Qwen3 appears to combine both approaches: human-curated reasoning traces for SFT stages, then RL for refinement.

The 36T token figure places Qwen3 in the same training compute tier as Llama 3.1 (15T+) and Mistral Large series, though direct comparison is complicated by differences in tokeniser efficiency across languages.

---

Availability and How to Run It

All Qwen3 models are available immediately:

For local inference, Qwen3-8B in Q4_K_M quantisation runs comfortably on a single RTX 4090. Qwen3-30B-A3B (MoE) is particularly interesting for local use: the sparse activation means the effective memory footprint at inference is closer to a 3B dense model than a 30B one, making it viable on 16GB VRAM with aggressive quantisation.

---

What to Watch Next

The Qwen3 release is the opening move, not the conclusion. Several things are worth tracking closely:

  • Third-party benchmark replication: The 4B-beats-72B claim is extraordinary and needs independent verification. Watch the Open LLM Leaderboard and community eval threads on Hugging Face.
  • DeepSeek's response: DeepSeek V4 / R2 has been anticipated for Q2 2025. Qwen3's strong AIME numbers will accelerate that timeline pressure.
  • Enterprise adoption in China: Whether domestic Chinese enterprises — particularly in finance and manufacturing, where Alibaba Cloud has strong existing relationships — migrate workloads to Qwen3 is the real commercial test.
  • Fine-tune ecosystem: Apache 2.0 means we should expect a wave of domain-specific fine-tunes within weeks. Medical, legal, and financial variants will be the early indicators of ecosystem health.

Qwen3 is the most complete open-weight model family released by any Chinese lab to date, and arguably competitive with the global open-weight frontier. The hybrid thinking architecture is a genuine innovation worth the attention of anyone building LLM-powered applications in 2025.

#Qwen3#Alibaba#open-weight#China AI#DeepSeek#MoE#reasoning#Qwen#large language models#benchmark
Wei Lian
Wei Lian

🇨🇳 China Desk Lead · Beijing, China

Reads the Mandarin sources first — DeepSeek, Qwen, Zhipu, and the rest.

Comments

Open discussion — no account needed. Be respectful.

0/4000
Loading comments…