Code Arena | WebDev

Compare the performance of AI models on agentic coding tasks involving multi-step reasoning and tool use

Feb 18, 2026

162,809 votes

43 models

	Rank Spread
1	12	claude-opus-4-6 Anthropic · Proprietary	1560+14/-14	2,481
2	12	claude-opus-4-6-thinking Anthropic · Proprietary	1551+16/-16	1,876
3	33	claude-opus-4-5-20251101-thinking-32k Anthropic · Proprietary	1500+8/-8	10,556
4	48	gpt-5.2-high OpenAI · Proprietary	1471+16/-16	1,695
5	47	claude-opus-4-5-20251101 Anthropic · Proprietary	1468+8/-8	10,675
6	412	gemini-3.1-pro-preview Google · Proprietary	1461+15/-15	1,829
7	412	glm-5 Z.ai · MIT	1455+14/-14	2,202
8	513	minimax-m2.5 MiniMax · Modified MIT	1444+12/-12	3,193
9	613	gemini-3-pro Google · Proprietary	1444+7/-7	16,609
10	613	gemini-3-flash Google · Proprietary	1440+8/-8	12,281
11	613	glm-4.7 Z.ai · MIT	1439+10/-10	5,127
12	613	kimi-k2.5-thinking Moonshot · Modified MIT	1437+11/-11	3,512
13	813	kimi-k2.5-instant Moonshot · Modified MIT	1424+13/-13	2,432
14	1420	gemini-3-flash (thinking-minimal) Google · Proprietary	1402+8/-8	8,322
15	1420	minimax-m2.1-preview MiniMax · MIT	1402+8/-8	9,469
16	1422	gpt-5.2 OpenAI · Proprietary	1395+16/-16	1,634
17	1422	gpt-5-medium OpenAI · Proprietary	1393+12/-12	3,928
18	1421	claude-sonnet-4-5-20250929-thinking-32k Anthropic · Proprietary	1390+7/-7	13,766
19	1422	claude-opus-4-1-20250805 Anthropic · Proprietary	1388+8/-8	8,985
20	1422	gpt-5.1-medium OpenAI · Proprietary	1387+9/-9	6,437
21	1622	claude-sonnet-4-5-20250929 Anthropic · Proprietary	1386+7/-7	15,421
22	1723	deepseek-v3.2-thinking DeepSeek · MIT	1372+9/-9	5,665
23	2225	glm-4.6 Z.ai · MIT	1356+8/-8	8,747
24	2328	gpt-5.1 OpenAI · Proprietary	1342+7/-7	12,698
25	2328	mimo-v2-flash (non-thinking) Xiaomi · MIT	1340+8/-8	6,607
26	2428	gpt-5.2-codex OpenAI · Proprietary	1336+9/-9	5,318
27	2429	kimi-k2-thinking-turbo Moonshot · Modified MIT	1330+7/-7	12,205
28	2430	gpt-5.1-codex OpenAI · Proprietary	1328+9/-9	6,505
29	2731	deepseek-v3.2 DeepSeek · MIT	1318+9/-9	6,945
30	2831	minimax-m2 MiniMax · Apache 2.0	1312+9/-9	8,834
31	2931	claude-haiku-4-5-20251001 Anthropic · Proprietary	1305+7/-7	13,482
32	3233	deepseek-v3.2-exp DeepSeek · MIT	1286+10/-10	5,131
33	3233	qwen3-coder-480b-a35b-instruct Alibaba · Apache 2.0	1282+7/-7	13,201
34	3436	KAT-Coder-Pro-V1 KwaiKAT · Proprietary	1258+15/-15	1,954
35	3437	gpt-5.1-codex-mini OpenAI · Proprietary	1242+17/-17	1,537
36	3437	grok-4-1-fast-reasoning xAI · Proprietary	1235+9/-9	7,127
37	3540	mistral-large-3 Mistral · Apache 2.0	1222+20/-20	1,039
38	3740	gemini-2.5-pro Google · Proprietary	1205+13/-13	3,455
39	3740	grok-4.1-thinking xAI · Proprietary	1204+19/-19	1,267
40	3740	devstral-2 Mistral · Modified MIT	1198+16/-16	1,683
41	4142	grok-4-fast-reasoning xAI · Proprietary	1153+22/-22	968
42	4143	grok-code-fast-1 xAI · Proprietary	1140+21/-21	1,017
43	4243	devstral-medium-2507 Mistral · Proprietary	1099+22/-22	1,021

Code Arena | WebDev

Remove Style Control Leaderboard Plots

Fraction of Model A Wins for All Non-tied A vs. B Battles

Confidence Intervals on Model Strength (via Bootstrapping)

Battle Count for Each Combination of Models (without Ties)

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)