I saw a discussion about the Consensys “evm-analyzer-benchmark-suite” come up in Telegram, and I thought I would write out my comments on the forum.
We think that the Consensys benchmark is a good first step toward improving the consistency and accuracy of security tools in the Ethereum ecosystem. Most tools have no hope of correct operation if they cannot pass this benchmark. However, we need to set the bar much higher to encourage progress in the field and I would recommend a portfolio approach with different benchmarking strategies to do so.
This is a path we have been down before. In the compiled software community, the earliest effort at benchmarking static analysis tools, Juliet, used a similar approach to Consensys. Juliet contains thousands of microbenchmarks for C, C++, and Java. In an attempt to set a higher bar, the DOD then created the STONESOUP test suite which inserted new vulnerabilities into open source software. This worked, but the evaluated tools frequently discovered previously unknown flaws in the software not intended by the test suite. Finally, DARPA funded the development of their Challenge Sets in 2014. This test suite contains real, complex programs with identified vulnerabilities, patches, exploit triggers, and more.
We started to emulate the DARPA Challenge Sets approach and created Not So Smart Contracts, a collection of real smart contract code with known vulnerabilities. Admittedly, working on this repository has not been a primary focus area for our team and there are many opportunities for improvement. Ultimately, we think it will present the most challenging and comprehensive benchmark for the community as it matures.
In many ways, the gold standard for effective benchmark development is SV-COMP, an annual competition to objectively evaluate software verification tools. Inspired by our own experience and the effectiveness of the SV-COMP benchmarks, we recommend that any security benchmark include a minimum of the following to ensure good outcomes:
- Each test should include only a single vulnerability
- Each test should include a human and a machine readable description
- Each test should include vulnerable source code and a compiled binary
- Each test should include an annotated correct or non-vulnerable variant
In particular, effective benchmarks include the fixed version to assist with false positive detection.
We’ll be focusing on adding these features to our Not So Smart Contracts repository as we go, and we would encourage Consensys to do the same with theirs.