CTFBench - Benchmarking AI Smart Contract Auditors

About CTFBench

CTFBench is a benchmark for evaluating AI smart contract auditors. It uses a set of test cases where each smart contract has exactly one known vulnerability. The benchmark calculates two key metrics:

Vulnerability Detection Rate (VDR): The proportion of detected vulnerabilities, indicating how many of the known vulnerabilities the AI auditor successfully identifies.
Overreporting Index (OI): The number of false positives reported per line of code in contracts that are known to be free of vulnerabilities, measuring the auditor's tendency to raise unnecessary alerts.

By plotting these metrics on a graph, CTFBench allows for a visual comparison of different AI auditors, helping developers and researchers assess their effectiveness and trade-offs.

Benchmark Results for AI Smart Contract Auditors

Name	VDR	OI
savant.chat v0.2	0.952	0.027
savant.chat v0.1	0.857	0.033
grok 3 thinking	0.524	0.037
ARMUR	0.524	0.090
openai_o3_mini_high	0.429	0.056
openai_o3_mini	0.429	0.064
deepseek_r1	0.429	0.070
Code Genie AI	0.333	0.023
slither	0.238	0.130
QuillShield	0.143	0.009
Aegis	0.143	0.118
AuditOne	0.095	0.028
SCAU	0.000	0.051

🪲 CTFBench 🚩

About CTFBench

Benchmark Results for AI Smart Contract Auditors

Performance Graph