Back to News

Alibaba Qwen QwQ-32B: A Demonstration of Scaled Reinforcement Learning

Friday, Mar 7, 2025

Alibaba Qwen QwQ-32B: A Demonstration of Scaled Reinforcement Learning

The Qwen team at Alibaba has revealed QwQ-32B, a 32 billion parameter AI model showing outstanding results that compete with the bigger DeepSeek-R1. This achievement underscores the impact of scaling Reinforcement Learning (RL) on strong foundational models.

The Qwen team has seamlessly integrated agent capabilities into the reasoning model, allowing it to engage in critical thinking, use tools effectively, and adapt its reasoning skills based on environmental feedback.

Expanding RL has the capability to boost model performance beyond traditional pretraining and post-training techniques, stated the team. Recent research has shown that RL can remarkably improve the reasoning abilities of models.

The QwQ-32B achieves outcomes similar to DeepSeek-R1, famed for its 671 billion parameters (37 billion activated), serving as evidence for RL's effectiveness when applied to strong foundational models trained on broad world knowledge. This demonstrates RL's potential to narrow the gap between model size and performance.

The model has been assessed across numerous benchmarks, such as AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL, aimed at evaluating its skills in mathematical reasoning, coding, and general problem-solving.

The findings emphasize QwQ-32Bs capabilities in comparison to other top models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, o1-mini, and the original DeepSeek-R1.

Benchmark results:

The Qwen team's strategy involved a foundational checkpoint and a multi-phase RL process steered by outcome-based rewards. The initial phase concentrated on enhancing RL for math and coding tasks using accuracy verifiers and code execution servers. The succeeding phase expanded to broader abilities, integrating rewards from general reward models and rule-based verifiers.

We observe that this phase of RL training with a minimal number of steps can elevate the performance of other general abilities, such as following instructions, aligning with human preferences, and agent performance, without major drops in math and coding performance, said the team.

QwQ-32B is available as open-weight on Hugging Face and ModelScope under the Apache 2.0 license, and also accessible through Qwen Chat. The Qwen team considers this the first step in scaling RL to boost reasoning skills and plans to further examine integrating agents with RL for long-term reasoning.

As we progress towards creating the next generation of Qwen, we are optimistic that combining more robust foundational models with RL, powered by enhanced computational resources, will move us closer to realizing Artificial General Intelligence (AGI), the team declared.

Latest News

Here are some news that you might be interested in.