Tencent Enhances Evaluation of Creative AI Models with Innovative Benchmark

Thursday, Jul 10, 2025

Tencent has unveiled a new standard named ArtifactsBench, designed to address existing challenges in evaluating creative AI models.

Have you ever tasked an AI with creating something like a basic webpage or chart, only to find the result functional yet lacking in user-friendliness? This is a familiar issue, highlighting a significant hurdle in AI development: teaching machines to have a refined sensibility.

Traditionally, AI models have been tested on their capability to produce correct code. While these checks ensure the code runs, they often overlook the visual and interactive components critical to modern user experience.

ArtifactsBench is tailor-made to address this precise problem, functioning more as an automated art critic for code generated by AIs.

Thrilled to introduce #ArtifactsBench! Our new benchmark is closing the gap in evaluating code generation for visual and interactive elements.

Using an innovative automated, multimodal pipeline, it assesses LLMs on 1,825 diverse tasks. The MLLM-as-Judge accurately rates visual output with an impressive 94.4% ranking.

So, how does Tencent’s AI benchmark work? It begins by assigning an AI a creative task from a collection of over 1,800 scenarios, encompassing everything from data visualization and web applications to interactive mini-games.

Upon generation of the code, ArtifactsBench begins its analysis. It compiles and executes the code in a secure, controlled environment.

It then examines the application's operation by capturing multiple screenshots over time, assessing animations, altered states post-interaction, and other dynamic user feedback elements.

The gathered data – encompassing the initial task, the AI-generated code, and the screenshots – is subsequently evaluated by a Multimodal LLM (MLLM) serving as an adjudicator.

The MLLM judge doesn’t issue just any judgement; it employs a thorough task-based checklist, measuring the output on various criteria. These include functionality, user experience, and aesthetic appeal, ensuring a fair and exhaustive evaluation.

One pertinent question arises: does this automated judge possess a refined taste? Findings indicate it does.

ArtifactsBench’s rankings were aligned with those of the WebDev Arena, the leading platform for human-assessed AI creations, displaying a staggering 94.4% agreement. This marks a substantial improvement from older benchmarks, which showed around 69.4% consistency.

Moreover, the ArtifactsBench results had over 90% concurrence with professional human developers’ judgments.

When over 30 leading global AI models were tested by Tencent, intriguing insights emerged. While top models from Google and Anthropic led the scoreboard, the tests revealed something fascinating.

It's natural to assume that an AI designed specifically for coding would excel at these tasks. Yet, the findings showed otherwise. Research indicates that “generalist models' overall capabilities often outperform those of specialized models.”

A general-purpose model, Qwen-2.5-Instruct, outperformed its specialized counterparts, Qwen-2.5-coder (tailored for code) and Qwen2.5-VL (focused on vision).

This success, researchers suggest, is due to the fact that creating superior visual applications requires a fusion of coding and visual understanding skills.

These skills, described as “robust reasoning, nuanced instruction following, and an implicit sense of design aesthetics,” are precisely the kinds of holistic, almost human-like capabilities emerging in top generalist models.

Tencent aspires for ArtifactsBench to effectively gauge these attributes, thus assessing future advancements in AI’s capacity to create not only functional but also desirable user experiences.

Latest News

Here are some news that you might be interested in.

Saturday, Jul 12, 2025

IBM Power11 Aims to Boost Enterprise AI Usage with Uninterrupted Architecture

IBM's Power11 enterprise servers tackle a longstanding issue in enterprise computing: deploying AI tasks without sacrificing the unfaltering reliability essential for mission-critical applications. Revealed on July 8, 2025, the Power11 demonstrates IBM's belief that businesses will favor comprehensive solutions over the existing mix of specialized AI hardware and conventional servers that many companies manage today.

Saturday, Jul 12, 2025

How AI is Transforming the Insurance Sector's Standards

Although the insurance industry has traditionally been cautious, it is experiencing a transformative shift due to the influence of AI.

Friday, Jul 11, 2025

Google's Introduction of MedGemma AI Models Poised to Revolutionize Healthcare

Instead of restricting their new MedGemma AI models to costly APIs, Google is offering these advanced tools to healthcare developers.

Wednesday, Jul 9, 2025

Key AI Executive Departs Apple for Meta

Apple is facing a new challenge this week with the departure of a significant AI leader to Meta. Ruoming Pang, previously at the helm of Apple Intelligence, has moved over to Meta's recently established Superintelligence Labs.