CTFBench - Benchmarking AI Smart Contract Auditors

About CTFBench

CTFBench is a benchmark for evaluating AI smart contract auditors. It uses a set of test cases where each smart contract has exactly one known vulnerability. The benchmark calculates two key metrics:

Vulnerability Detection Rate (VDR): The proportion of detected vulnerabilities, indicating how many of the known vulnerabilities the AI auditor successfully identifies.
Overreporting Index (OI): The number of false positives reported per line of code in contracts that are known to be free of vulnerabilities, measuring the auditor's tendency to raise unnecessary alerts.

By plotting these metrics on a graph, CTFBench allows for a visual comparison of different AI auditors, helping developers and researchers assess their effectiveness and trade-offs.

Benchmark Results for AI Smart Contract Auditors

Name	VDR	OI
SavantChat Dec 2025	1.000	0.005
gpt_5.5	1.000	0.098
gemini_3.1_pro	0.984	0.096
SavantChat May 2025	0.952	0.027
claude_opus_4.6	0.889	0.101
SavantChat Mar 2025	0.857	0.033
kimi_k2.6	0.825	0.084
claude_opus_4.7	0.730	0.202
claude_opus_4.5	0.714	0.121
gpt_5	0.714	0.111
deepseek_v4_pro	0.648	0.058
gemini_2.5_pro	0.571	0.056
gpt_5.4	0.571	0.106
mimo_v2.5_pro	0.556	0.123
gpt_5.2	0.524	0.109
grok 3 thinking	0.524	0.037
minimax_m2.7	0.508	0.134
ARMUR	0.524	0.090
openai_o3_mini_high	0.429	0.056
openai_o3_mini	0.429	0.064
deepseek_r1	0.429	0.070
Code Genie AI	0.333	0.023
slither	0.238	0.130
QuillShield	0.143	0.009
Aegis	0.143	0.118
AuditOne	0.095	0.028
SCAU	0.000	0.051

🪲 CTFBench 🚩

About CTFBench

Benchmark Results for AI Smart Contract Auditors

Performance Graph