Tencent Enhances Evaluation of Creative AI Models with Innovative Benchmark
Thursday, Jul 10, 2025

Tencent has unveiled a new standard named ArtifactsBench, designed to address existing challenges in evaluating creative AI models.
Have you ever tasked an AI with creating something like a basic webpage or chart, only to find the result functional yet lacking in user-friendliness? This is a familiar issue, highlighting a significant hurdle in AI development: teaching machines to have a refined sensibility.
Traditionally, AI models have been tested on their capability to produce correct code. While these checks ensure the code runs, they often overlook the visual and interactive components critical to modern user experience.
ArtifactsBench is tailor-made to address this precise problem, functioning more as an automated art critic for code generated by AIs.
Thrilled to introduce #ArtifactsBench! Our new benchmark is closing the gap in evaluating code generation for visual and interactive elements.
Using an innovative automated, multimodal pipeline, it assesses LLMs on 1,825 diverse tasks. The MLLM-as-Judge accurately rates visual output with an impressive 94.4% ranking.
So, how does Tencent’s AI benchmark work? It begins by assigning an AI a creative task from a collection of over 1,800 scenarios, encompassing everything from data visualization and web applications to interactive mini-games.
Upon generation of the code, ArtifactsBench begins its analysis. It compiles and executes the code in a secure, controlled environment.
It then examines the application's operation by capturing multiple screenshots over time, assessing animations, altered states post-interaction, and other dynamic user feedback elements.
The gathered data – encompassing the initial task, the AI-generated code, and the screenshots – is subsequently evaluated by a Multimodal LLM (MLLM) serving as an adjudicator.
The MLLM judge doesn’t issue just any judgement; it employs a thorough task-based checklist, measuring the output on various criteria. These include functionality, user experience, and aesthetic appeal, ensuring a fair and exhaustive evaluation.
One pertinent question arises: does this automated judge possess a refined taste? Findings indicate it does.
ArtifactsBench’s rankings were aligned with those of the WebDev Arena, the leading platform for human-assessed AI creations, displaying a staggering 94.4% agreement. This marks a substantial improvement from older benchmarks, which showed around 69.4% consistency.
Moreover, the ArtifactsBench results had over 90% concurrence with professional human developers’ judgments.
When over 30 leading global AI models were tested by Tencent, intriguing insights emerged. While top models from Google and Anthropic led the scoreboard, the tests revealed something fascinating.
It's natural to assume that an AI designed specifically for coding would excel at these tasks. Yet, the findings showed otherwise. Research indicates that “generalist models' overall capabilities often outperform those of specialized models.”
A general-purpose model, Qwen-2.5-Instruct, outperformed its specialized counterparts, Qwen-2.5-coder (tailored for code) and Qwen2.5-VL (focused on vision).
This success, researchers suggest, is due to the fact that creating superior visual applications requires a fusion of coding and visual understanding skills.
These skills, described as “robust reasoning, nuanced instruction following, and an implicit sense of design aesthetics,” are precisely the kinds of holistic, almost human-like capabilities emerging in top generalist models.
Tencent aspires for ArtifactsBench to effectively gauge these attributes, thus assessing future advancements in AI’s capacity to create not only functional but also desirable user experiences.
Latest News
Here are some news that you might be interested in.

Saturday, Jul 12, 2025
IBM Power11 Aims to Boost Enterprise AI Usage with Uninterrupted Architecture
Read more

Saturday, Jul 12, 2025
How AI is Transforming the Insurance Sector's Standards
Read more

Friday, Jul 11, 2025
Google's Introduction of MedGemma AI Models Poised to Revolutionize Healthcare
Read more

Wednesday, Jul 9, 2025
Key AI Executive Departs Apple for Meta
Read more