Alibaba Qwen QwQ-32B: A Demonstration of Scaled Reinforcement Learning

Friday, Mar 7, 2025

The Qwen team at Alibaba has revealed QwQ-32B, a 32 billion parameter AI model showing outstanding results that compete with the bigger DeepSeek-R1. This achievement underscores the impact of scaling Reinforcement Learning (RL) on strong foundational models.

The Qwen team has seamlessly integrated agent capabilities into the reasoning model, allowing it to engage in critical thinking, use tools effectively, and adapt its reasoning skills based on environmental feedback.

Expanding RL has the capability to boost model performance beyond traditional pretraining and post-training techniques, stated the team. Recent research has shown that RL can remarkably improve the reasoning abilities of models.

The QwQ-32B achieves outcomes similar to DeepSeek-R1, famed for its 671 billion parameters (37 billion activated), serving as evidence for RL's effectiveness when applied to strong foundational models trained on broad world knowledge. This demonstrates RL's potential to narrow the gap between model size and performance.

The model has been assessed across numerous benchmarks, such as AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL, aimed at evaluating its skills in mathematical reasoning, coding, and general problem-solving.

The findings emphasize QwQ-32Bs capabilities in comparison to other top models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.

Benchmark results:

The Qwen team's strategy involved a foundational checkpoint and a multi-phase RL process steered by outcome-based rewards. The initial phase concentrated on enhancing RL for math and coding tasks using accuracy verifiers and code execution servers. The succeeding phase expanded to broader abilities, integrating rewards from general reward models and rule-based verifiers.

We observe that this phase of RL training with a minimal number of steps can elevate the performance of other general abilities, such as following instructions, aligning with human preferences, and agent performance, without major drops in math and coding performance, said the team.

QwQ-32B is available as open-weight on Hugging Face and ModelScope under the Apache 2.0 license, and also accessible through Qwen Chat. The Qwen team considers this the first step in scaling RL to boost reasoning skills and plans to further examine integrating agents with RL for long-term reasoning.

As we progress towards creating the next generation of Qwen, we are optimistic that combining more robust foundational models with RL, powered by enhanced computational resources, will move us closer to realizing Artificial General Intelligence (AGI), the team declared.

Latest News

Here are some news that you might be interested in.

Tuesday, Sep 16, 2025

Mythos AI and Lomarlabs Launch AI-Powered Navigation for Marine Pilots

American maritime firm Mythos AI has successfully placed its Advanced Pilot Assistance System (APAS) aboard the CB Pacific, a chemical cargo vessel owned by CB Tankers.

Friday, Sep 12, 2025

Yext Navigates Brands Through AI Search Obstacle Courses

Today's customers are exploring brands and learning about products and services in new and diverse ways, ranging from traditional search methods to AI searches and agents. The discovery process has evolved dramatically, prompting brands to adapt to this new era.

Friday, Sep 12, 2025

VMware Embraces AI with an Eye on Future Growth

Broadcom, the owner of VMware, has announced that its VMware Cloud Foundation platform is now equipped with AI capabilities. This announcement took place during the recent VMware Explore conference.

Wednesday, Sep 10, 2025

Thinking Machines Named OpenAI's Premiere Services Partner in Asia-Pacific Region

Thinking Machines Data Science has partnered with OpenAI to enable more businesses in the Asia Pacific to achieve measurable outcomes through artificial intelligence. This partnership designates Thinking Machines as the first official Services Partner for OpenAI in the region.