🪲 CTFBench 🚩

About CTFBench

CTFBench is a benchmark for evaluating AI smart contract auditors. It uses a set of test cases where each smart contract has exactly one known vulnerability. The benchmark calculates two key metrics:

By plotting these metrics on a graph, CTFBench allows for a visual comparison of different AI auditors, helping developers and researchers assess their effectiveness and trade-offs.

Benchmark Results for AI Smart Contract Auditors

Name VDR OI
SavantChat Dec 2025 1.000 0.005
gpt_5.5 1.000 0.098
gemini_3.1_pro 0.984 0.096
SavantChat May 2025 0.952 0.027
claude_opus_4.6 0.889 0.101
SavantChat Mar 2025 0.857 0.033
kimi_k2.6 0.825 0.084
claude_opus_4.7 0.730 0.202
claude_opus_4.5 0.714 0.121
gpt_5 0.714 0.111
deepseek_v4_pro 0.648 0.058
gemini_2.5_pro 0.571 0.056
gpt_5.4 0.571 0.106
mimo_v2.5_pro 0.556 0.123
gpt_5.2 0.524 0.109
grok 3 thinking 0.524 0.037
minimax_m2.7 0.508 0.134
ARMUR 0.524 0.090
openai_o3_mini_high 0.429 0.056
openai_o3_mini 0.429 0.064
deepseek_r1 0.429 0.070
Code Genie AI 0.333 0.023
slither 0.238 0.130
QuillShield 0.143 0.009
Aegis 0.143 0.118
AuditOne 0.095 0.028
SCAU 0.000 0.051

Performance Graph