RealDevWorld Leaderboard

Project Overview

Real-World Tasks

Based on actual application development needs, providing comprehensive evaluation of development capabilities

Multi-Dimensional Evaluation

Assessing AI agent capabilities across requirement understanding, code implementation, debugging, and more

Agent-as-a-Judge Evaluation

Novel evaluation paradigm using autonomous agents for interactive software testing and assessment

Dataset

RealDevBench

A comprehensive real-world development task dataset containing various application development scenarios and task types. This benchmark evaluates AI agents' capabilities across multiple dimensions of software development.

194

Development Tasks

4

Task Categories

Open

Source

Key Features

Real-World Scenarios: Actual development challenges
Multimodal Tasks: Text, images, audio, and data
End-to-End Evaluation: From understanding to debugging

Task Type Distribution

Display (50.0%)

Analysis (18.6%)

Game (17.0%)

Data (14.4%)

Access on HuggingFace Dataset Case

Dataset & Evaluation

RealDevBench

A comprehensive real-world development task dataset containing various application development scenarios and task types. This benchmark evaluates AI agents' capabilities across multiple dimensions of software development with automated GUI testing and interactive assessment.

194

Development Tasks

4

Task Categories

Open

Source

Key Features

Real-World Scenarios: Actual development challenges
Multimodal Tasks: Text, images, audio, and data
End-to-End Evaluation: From understanding to debugging
Human-Aligned Assessment: 92% expert correlation

Task Type Distribution

Display (50.0%)

Analysis (18.6%)

Game (17.0%)

Data (14.4%)

Access on HuggingFace Dataset Case View Results

Dataset Case Analysis

Detailed showcase of typical cases in the RealDevBench dataset, including task types, difficulty distribution, and evaluation criteria.

AppEvalPilot Framework

Comprehensive research methodology documentation including evaluation framework design, metric definitions, and experimental setup.

AppEvalPilot's three-stage evaluation pipeline: test case generation, interactive execution, and automated assessment with GUI interaction capabilities.

Experimental Results

Comparative performance analysis showing different AI Agents and frameworks across multiple evaluation dimensions.

AppEvalPilot Performance Comparison

Method	Feature-level		Test Case-level			Efficiency
Method	Quality	Align.	Quality	Align.	Acc.	Time	Cost
Human	0.74	—	0.65	—	—	—	—
GUI Model
Claude-3.5-Sonnet	0.27	0.23	0.46	0.49	0.68	9.20	1.01
UI-Tars	0.49	0.29	0.63	0.59	0.75	8.65	0.17
GUI Agent Framework
WebVoyager (Qwen2.5)	0.29	0.25	0.35	0.44	0.6	2.16	0.04
WebVoyager (Claude)	0.64	0.43	0.6	0.55	0.74	1.60	0.10
Browser-Use (Claude)	0.67	0.58	0.63	0.61	0.76	13.50	1.13
AppEvalPilot (Claude)	0.73	0.85	0.74	0.81	0.92	9.0	0.26

Performance comparison on RealDevBench benchmark. AppEvalPilot achieves superior performance with 92% accuracy and 85% human correlation, demonstrating significant improvements in evaluation quality and efficiency.

Leaderboard

If you would like to submit your system or model to any of our leaderboards (Products, Open-source & LLM, or Overall), please follow the instructions provided in our submission guide.

Rank	Agent Name	Model	Organization	Agent Quality	Code Quality	Visual Quality
1	MGX (BoN-3)	MGX Framework	MGX Team	0.78	0.72	0.41
2	Lovable	Lovable Framework	Lovable Team	0.74	0.58	0.47
3	MGX	MGX Framework	MGX Team	0.60	0.68	0.41
4	Bolt	Bolt Framework	StackBlitz	0.54	0.69	0.50
5	Qwen3-Coder-480B	Qwen3-Coder-480B	Alibaba	0.53	0.41	0.32
6	OpenHands	OpenHands Framework	OpenHands Team	0.50	0.38	0.33
7	Kimi-K2	Kimi-K2	Moonshot AI	0.39	0.41	0.29
8	Claude-3.7-Sonnet	Claude-3.7-Sonnet	Anthropic	0.31	0.41	0.18
9	Gemini-2.5-Pro	Gemini-2.5-Pro	Google	0.29	0.45	0.26
10	DeepSeek-V3	DeepSeek-V3	DeepSeek	0.29	0.18	0.21

Products leaderboard data loading...

Open-source leaderboard data loading...

AppEvalPilot Demo & Case Studies

Watch our Agent-as-a-Judge evaluation system in action and explore detailed case studies

AppEvalPilot Agent-as-a-Judge Evaluation

Real-time demonstration of autonomous software evaluation with GUI interaction and automated testing

Key Capabilities

Intelligent Test Generation

Automatically generates comprehensive test cases from software requirements using few-shot learning and domain-specific knowledge

GUI Automation

Executes dynamic user interactions through real GUI operations using PyAutoGUI for mouse and keyboard emulation

Real-time Verification

Performs live functional verification and validates software behavior against requirements with adaptive decision-making

Human-aligned Assessment

Provides accurate evaluations with 92% accuracy and 85% correlation to expert human judgments

View Source Code View Research Details

Case Studies

Expense Planner

An intelligent personal finance tool that helps users track spending, categorize expenses, and set monthly budgets with automatic analytics.

View Details

Festival Planner

An event management application for organising festival schedules, managing performers, venues, and ticketing information interactively.

View Details

Language Spelling Bee

A language-learning quiz platform that offers interactive spelling challenges, real-time feedback, and progress tracking across difficulty levels.

View Details

Real-World Development Capability Benchmark

Project Overview

Real-World Tasks

Multi-Dimensional Evaluation

Agent-as-a-Judge Evaluation

Dataset

RealDevBench

Key Features

Task Type Distribution

Dataset & Evaluation

RealDevBench

Key Features

Task Type Distribution

Dataset Case Analysis

AppEvalPilot Framework

Experimental Results

AppEvalPilot Performance Comparison

Leaderboard

AppEvalPilot Demo & Case Studies

AppEvalPilot Agent-as-a-Judge Evaluation

Key Capabilities

Intelligent Test Generation

GUI Automation

Real-time Verification

Human-aligned Assessment

Case Studies

Expense Planner

Festival Planner

Language Spelling Bee