CTFBench is a benchmark for evaluating AI smart contract auditors. It uses a set of test cases where each smart contract has exactly one known vulnerability. The benchmark calculates two key metrics:
By plotting these metrics on a graph, CTFBench allows for a visual comparison of different AI auditors, helping developers and researchers assess their effectiveness and trade-offs.
| Name | VDR | OI |
|---|---|---|
| savant.chat v0.2 | 0.952 | 0.027 |
| savant.chat v0.1 | 0.857 | 0.033 |
| grok 3 thinking | 0.524 | 0.037 |
| ARMUR | 0.524 | 0.090 |
| openai_o3_mini_high | 0.429 | 0.056 |
| openai_o3_mini | 0.429 | 0.064 |
| deepseek_r1 | 0.429 | 0.070 |
| Code Genie AI | 0.333 | 0.023 |
| slither | 0.238 | 0.130 |
| QuillShield | 0.143 | 0.009 |
| Aegis | 0.143 | 0.118 |
| AuditOne | 0.095 | 0.028 |
| SCAU | 0.000 | 0.051 |