CTFBench is a benchmark for evaluating AI smart contract auditors. It uses a set of test cases where each smart contract has exactly one known vulnerability. The benchmark calculates two key metrics:
By plotting these metrics on a graph, CTFBench allows for a visual comparison of different AI auditors, helping developers and researchers assess their effectiveness and trade-offs.
Name | VDR | OI |
---|---|---|
savant.chat v0.2 | 0.952 | 0.027 |
savant.chat v0.1 | 0.857 | 0.033 |
grok 3 thinking | 0.524 | 0.037 |
ARMUR | 0.524 | 0.090 |
openai_o3_mini_high | 0.429 | 0.056 |
openai_o3_mini | 0.429 | 0.064 |
deepseek_r1 | 0.429 | 0.070 |
Code Genie AI | 0.333 | 0.023 |
slither | 0.238 | 0.130 |
QuillShield | 0.143 | 0.009 |
Aegis | 0.143 | 0.118 |
AuditOne | 0.095 | 0.028 |
SCAU | 0.000 | 0.051 |