Qwen3 Arrives: Alibaba's Hybrid-Thinking Model Family Rewrites the Open-Weight Benchmark Map

Alibaba's Qwen Team has dropped Qwen3, a family of eight models spanning 0.6B to 235B parameters, featuring a novel hybrid thinking mode that lets users toggle chain-of-thought reasoning on or off at inference time. The release lands squarely in the middle of China's most competitive open-weight moment yet.

Wei Lian🇨🇳 China Desk LeadJul 2, 2026 6m read

Qwen3 Is Here — and It Changes the Calculus for Open-Weight AI in China

On 28 April 2025, Alibaba's Qwen Team published the Qwen3 model family↗ to Hugging Face and ModelScope simultaneously, alongside a detailed technical blog post in both Chinese and English. The release comprises eight models — four dense and four Mixture-of-Experts (MoE) variants — ranging from 0.6B to 235B parameters. The flagship, Qwen3-235B-A22B, is a MoE architecture that activates 22 billion parameters per forward pass. Every weight in the family is released under the Apache 2.0 licence, meaning unrestricted commercial use with no application required.

This is not an incremental update. Qwen3 introduces what the team calls "hybrid thinking mode" — a single model that can operate in either a slow, chain-of-thought reasoning mode ("thinking mode") or a fast, direct-response mode ("non-thinking mode"), switchable at inference time via a chat template flag. That architectural choice has significant downstream implications for deployment cost and user experience that Western coverage has largely glossed over.

---

The Model Lineup in Full

Understanding the release requires holding the full matrix in your head:

Qwen3-0.6B — Dense, 0.6B parameters. Targets on-device and edge inference.
Qwen3-1.7B — Dense, 1.7B parameters. Embedded and mobile use cases.
Qwen3-4B — Dense, 4B parameters. Matches or exceeds prior Qwen2.5-72B on several reasoning tasks per internal evals.
Qwen3-8B — Dense, 8B parameters. The workhorse mid-range model.
Qwen3-14B — Dense, 14B parameters.
Qwen3-32B — Dense, 32B parameters. Competitive with DeepSeek-R1 at the 32B weight class on AIME 2024.
Qwen3-30B-A3B — MoE, 30B total / 3B active. Extremely efficient inference profile.
Qwen3-235B-A22B — MoE, 235B total / 22B active. Flagship.

All models support a 128K token context window and are multilingual across 119 languages and dialects — a figure the team attributes to a training corpus that explicitly targets low-resource language coverage, an area where most Western frontier labs still underinvest.

---

Hybrid Thinking: The Architectural Bet That Matters

The most technically interesting decision in Qwen3 is not the parameter count — it is the hybrid thinking architecture. Previous reasoning-specialised models from Chinese labs (DeepSeek-R1, QwQ-32B) were trained exclusively for chain-of-thought output. They are slow and expensive at inference time by design. Qwen3 trains a single model to do both.

"We enable Qwen3 models to seamlessly switch between 'thinking mode', for complex logical reasoning, math, and coding tasks, and 'non-thinking mode', for efficient, general-purpose dialogue. Users can control this via the enable_thinking flag in the chat template." > — Qwen Team technical blog, 28 April 2025↗

In practice, this means a developer deploying Qwen3-8B can route simple customer queries to non-thinking mode (low latency, low cost) and route complex coding or maths tasks to thinking mode — all within a single loaded model. No separate model serving infrastructure, no routing layer to a specialist model. For startups and independent developers running on constrained GPU budgets, this is a material operational advantage.

The training pipeline to achieve this is described in four stages in the technical report: (1) long-context pretraining on 36 trillion tokens, (2) reasoning-specialised supervised fine-tuning, (3) reinforcement learning for thinking mode, and (4) a final fusion stage that merges thinking and non-thinking capabilities without catastrophic forgetting. Stage 4 is the novel contribution; the team describes it as "thinking budget" annealing, gradually reducing enforced chain-of-thought length during training until the model learns to self-regulate.

---

Benchmark Position: Reading the Numbers Carefully

Alibaba's own benchmark table shows Qwen3-235B-A22B scoring:

85.7 on AIME 2024 (mathematics olympiad problems)
59.1 on LiveCodeBench (competitive programming)
79.0 on GPQA Diamond (graduate-level science)

For context, OpenAI's o3-mini scores approximately 79.6 on AIME 2024 under comparable pass@1 settings, and DeepSeek-R1 (671B total, 37B active) scores 79.8. Qwen3-235B-A22B, with far fewer active parameters, appears to match or exceed both on this specific benchmark.

The more striking number is at the smaller end: Qwen3-4B reportedly matches Qwen2.5-72B on several reasoning benchmarks. If reproducible by third parties — and that is an important caveat; independent replication is still in progress as of this writing — it represents roughly an 18x parameter efficiency gain over the course of a single model generation. That trajectory, if it holds, has serious implications for inference economics across the entire Chinese cloud market.

AIME 2024: Qwen3-235B-A22B 85.7 vs DeepSeek-R1 79.8 vs o3-mini ~79.6
LiveCodeBench v5: Qwen3-235B-A22B 59.1 vs Claude 3.7 Sonnet ~58.7
GPQA Diamond: Qwen3-235B-A22B 79.0 vs Gemini 2.5 Pro ~84.0 (Gemini retains a lead here)
Multilingual: 119 languages supported, versus 29 for the original Qwen2.5-72B

Readers should treat all vendor-reported benchmarks with appropriate scepticism until independent evaluations↗ are published. The Open LLM Leaderboard and EleutherAI's eval harness are the relevant third-party references to watch over the coming weeks.

---

Domestic Context: Where Qwen3 Lands in China's AI Race

Western coverage tends to frame Chinese AI releases in terms of US-China competition. The more immediately relevant frame is intra-China competition, which is ferocious right now.

DeepSeek set the pace in January 2025 with R1 and its genuinely disruptive open-weight release. Zhipu AI followed with GLM-4↗ updates and its own reasoning variants. Moonshot AI (月之暗面) has been aggressive on context length with its Kimi model series. ByteDance released Doubao-1.5-pro with strong coding benchmarks in February. And Baidu continues to iterate on ERNIE.

Alibaba's response with Qwen3 is notable for two strategic choices:

Refusing to fragment the model family: Rather than releasing a separate "Qwen3-Reasoner" product, the team baked reasoning into every model in the family. This is a direct counter to the ecosystem fragmentation that has made DeepSeek's lineup (V3, R1, R1-Zero, R1-Distill variants) somewhat confusing for developers to navigate.
Doubling down on open weights at scale: The 235B MoE being Apache 2.0 is a significant commitment. It signals that Alibaba views ecosystem capture — getting Qwen into downstream products, fine-tunes, and enterprise deployments — as more strategically valuable than API revenue protection at the frontier model tier.

"开源不是慈善，是生态战略。" ("Open source is not charity, it is ecosystem strategy.") > — Commonly attributed framing within Alibaba Cloud's developer relations team, widely circulated in Chinese AI developer communities on 知乎 (Zhihu) following the release.

This framing maps directly onto how Alibaba Cloud monetises the Qwen family: the weights are free, but enterprises running Qwen at scale on Alibaba Cloud's Model Studio↗ infrastructure pay for compute, fine-tuning pipelines, and managed deployment. Open-weight is the acquisition funnel; cloud compute is the revenue line.

---

Training Data and the 36-Trillion-Token Corpus

The Qwen3 technical report states pretraining on 36 trillion tokens, up from approximately 18 trillion for Qwen2.5. The composition is not fully disclosed, but the team notes:

Increased proportion of STEM-domain text (mathematics, science, code)
Expanded multilingual coverage, with explicit curation for 119 languages
Synthetic data generated by prior Qwen models used to bootstrap reasoning-domain coverage

The use of model-generated synthetic data for reasoning training is now standard practice across both Chinese and Western frontier labs — DeepSeek-R1-Zero demonstrated that pure RL on reasoning tasks without human-annotated chain-of-thought can produce competitive results. Qwen3 appears to combine both approaches: human-curated reasoning traces for SFT stages, then RL for refinement.

The 36T token figure places Qwen3 in the same training compute tier as Llama 3.1 (15T+) and Mistral Large series, though direct comparison is complicated by differences in tokeniser efficiency across languages.

---

Availability and How to Run It

All Qwen3 models are available immediately:

Hugging Face — Qwen3 collection↗
ModelScope — Qwen3 collection↗
API access via Alibaba Cloud Model Studio↗
Ollama support is confirmed for the smaller dense models; quantised GGUF versions are already appearing on Hugging Face from community contributors

For local inference, Qwen3-8B in Q4_K_M quantisation runs comfortably on a single RTX 4090. Qwen3-30B-A3B (MoE) is particularly interesting for local use: the sparse activation means the effective memory footprint at inference is closer to a 3B dense model than a 30B one, making it viable on 16GB VRAM with aggressive quantisation.

---

What to Watch Next

The Qwen3 release is the opening move, not the conclusion. Several things are worth tracking closely:

Third-party benchmark replication: The 4B-beats-72B claim is extraordinary and needs independent verification. Watch the Open LLM Leaderboard↗ and community eval threads on Hugging Face.
DeepSeek's response: DeepSeek V4 / R2 has been anticipated for Q2 2025. Qwen3's strong AIME numbers will accelerate that timeline pressure.
Enterprise adoption in China: Whether domestic Chinese enterprises — particularly in finance and manufacturing, where Alibaba Cloud has strong existing relationships — migrate workloads to Qwen3 is the real commercial test.
Fine-tune ecosystem: Apache 2.0 means we should expect a wave of domain-specific fine-tunes within weeks. Medical, legal, and financial variants will be the early indicators of ecosystem health.

Qwen3 is the most complete open-weight model family released by any Chinese lab to date, and arguably competitive with the global open-weight frontier. The hybrid thinking architecture is a genuine innovation worth the attention of anyone building LLM-powered applications in 2025.

#Qwen3#Alibaba#open-weight#China AI#DeepSeek#MoE#reasoning#Qwen#large language models#benchmark

Links & Resources

External links — opens in a new tab

Qwen3 Official Blog Post — Qwen Teamqwenlm.github.io

Qwen3 Hugging Face Collectionhuggingface.co

Qwen3 ModelScope Collectionmodelscope.cn

Open LLM Leaderboard — Hugging Facehuggingface.co

Alibaba Cloud Model Studioaliyun.com

GLM-4 — Zhipu AI on Hugging Facehuggingface.co

Wei Lian

🇨🇳 China Desk Lead · Beijing, China

Reads the Mandarin sources first — DeepSeek, Qwen, Zhipu, and the rest.

Partial Differential Equations: Theory, Methods, and Applications

by Richard Murdoch Montgomery

A rigorous, modern treatment of the heat, wave and Laplace equations — the math that underpins the physics of computation.

Buy on Amazon →

Scientific Calculators: Treatises and Manuals

by Richard Murdoch Montgomery

The definitive 15-volume series bridging user manuals and applied mathematics — from the TI-Nspire CX II CAS to financial solvers.

Buy on Amazon →

Comments

Open discussion — no account needed. Be respectful.

Loading comments…

More from Chinese Models Desk

Moonshot AI's Kimi K2.7 Code Lands in GitHub Copilot — The First Open-Weight Model in Microsoft's AI Roster

Moonshot AI's Kimi K2.7 Code became the first open-weight model to enter GitHub Copilot's model picker on July 1, 2026, completing a five-lab roster alongside OpenAI, Anthropic, Google, and Microsoft. The 1-trillion-parameter coding specialist, released June 12 under a Modified MIT license, brings 30% better token efficiency than its predecessor and aggressive $0.95/M input pricing to one of the world's largest developer platforms.

Wei Lian

Jul 2, 2026 10m

Qwen2’s Global Debut: Alibaba’s Open-Source LLM Raises the Stakes for Developers Everywhere

Alibaba Cloud’s release of Qwen2, a family of open-source language models up to 72B parameters, is a landmark move for China’s AI ecosystem and a potential game-changer for global developers. Here’s what makes Qwen2 different, why it matters internationally, and how you can start using it right now.

Sophia Chen

Jul 2, 2026 8m

Qwen2 Arrives: Alibaba’s Next-Gen Open-Weight Model Ups the Stakes in China’s LLM Race

Alibaba’s Qwen2 launch delivers a suite of open-weight models—outperforming Llama 3 on key benchmarks—backed by powerful Chinese corpora and a flexible licensing regime. Here’s why Qwen2’s release is a watershed for China’s open-source AI ecosystem.

Wei Lian

Jul 2, 2026 6m