CTFBench is a benchmark for evaluating AI smart contract auditors. It uses a set of test cases where each smart contract has exactly one known vulnerability. The benchmark calculates two key metrics:
By plotting these metrics on a graph, CTFBench allows for a visual comparison of different AI auditors, helping developers and researchers assess their effectiveness and trade-offs.
| Name | VDR | OI |
|---|---|---|
| SavantChat Dec 2025 | 1.000 | 0.005 |
| gpt_5.5 | 1.000 | 0.098 |
| gemini_3.1_pro | 0.984 | 0.096 |
| SavantChat May 2025 | 0.952 | 0.027 |
| claude_opus_4.6 | 0.889 | 0.101 |
| SavantChat Mar 2025 | 0.857 | 0.033 |
| kimi_k2.6 | 0.825 | 0.084 |
| claude_opus_4.7 | 0.730 | 0.202 |
| claude_opus_4.5 | 0.714 | 0.121 |
| gpt_5 | 0.714 | 0.111 |
| deepseek_v4_pro | 0.648 | 0.058 |
| gemini_2.5_pro | 0.571 | 0.056 |
| gpt_5.4 | 0.571 | 0.106 |
| mimo_v2.5_pro | 0.556 | 0.123 |
| gpt_5.2 | 0.524 | 0.109 |
| grok 3 thinking | 0.524 | 0.037 |
| minimax_m2.7 | 0.508 | 0.134 |
| ARMUR | 0.524 | 0.090 |
| openai_o3_mini_high | 0.429 | 0.056 |
| openai_o3_mini | 0.429 | 0.064 |
| deepseek_r1 | 0.429 | 0.070 |
| Code Genie AI | 0.333 | 0.023 |
| slither | 0.238 | 0.130 |
| QuillShield | 0.143 | 0.009 |
| Aegis | 0.143 | 0.118 |
| AuditOne | 0.095 | 0.028 |
| SCAU | 0.000 | 0.051 |