One of the aims of SecurEth and the Security Community interviews has been to create a list of common resources for the community. The ideal would be to have a benchmark of automated tooling for developers to use before they come to a formal audit. If you have tools that you love or tools that you wish were better, let’s talk about them. We are also hoping to find missing tools or promising projects that need a little extra support to finish needed work.
Another topic along this vein is the standards and level of testing that tools undergo. Automated tooling is awesome, and a great way to make the burdens of developing secure smart contracts manageable. However, since it affects production code that will touch funds, they need to be developed in such a way where it errs on the side of caution.
Taking away or obscuring where a potential issue could be is can make a tool actively harmful. Following security guidelines for the tools themselves will help the quality of toolsets increase, and perhaps we should be tracking the readiness levels of different tools in the tool chain (see TRL from NASA).
What do you all think?
It would fantastic to have a standardized set of benchmarks to measure tools against. One of our researchers (Suhabe) has developed this here:
The test suite is still incomplete though. It would be very helpful if he could reach consensus on a complete (ever-evolving) set of inputs & expected results.
Note: We have used this benchmark to improve Mythril so obviously it performs well on the test data (it’s not meant to be a jab at other tools - Manticore also does well out-of-the-box).
That’s a fantastic point, a test suite of all the different vulnerabilities and all their potential permutations would be great to test against for s given class of tools. Colloborative test suites = awesome!
Hey @captnseagraves, that sounds like a great idea! We also built a standardized benchmark to measure tools, with a focus on full contract implementations to best represent real-world usage.
We also maintain a corresponding benchmark for C/C++ and binary analysis tools, the DARPA Challenge Sets. These are ~200 real applications with known vulnerabilities in them.
You can see all the benefits that a standard benchmark like these provide by reading our blog post about it: https://blog.trailofbits.com/2016/08/01/your-tool-works-better-than-mine-prove-it/
Idea: Fuzz-based gas profiler for functions. Identify performance issues and potentially logical snags.
Not sure fuzz testing is necessary to find gas consumption related issues. solc has a --gas flag that prints an estimate of gas consumption for each external function (or it will say infinite if the compiler thinks it is unbounded).
I think “infinite” is the first part of the proof of why it’s necessary. Fuzzing would be able to create all the possible execution paths of the code, and then those paths can be analyzed for different gas costs and reduced to the selection of unique gas costs for a statistical analysis of the gas costs for different functions. You probably wouldn’t do much with it unless the std dev was significantly out of whack or there was a series of inputs that produced seemingly uncorrelated gas usage, which might point to an exploit or bug.
This is pretty easy to do with Echidna. I wrote https://github.com/trailofbits/echidna/pull/86 in about 45 minutes from opening an editor to committed in git. Should be pretty easy to extend to call sequences/more metrics/whatever fancy analysis if that’s useful to people.
Awesome. Can you do standard deviation? I.E. paths outside of standard deviation might be highlighted as interesting. They basically stake out what the envelope looks like for the function arguments. By looking at the corner cases of the envelope you can discover interesting things about behavior, things that might not raise errors in analysis but are large state changes or unexpected external calls.
haha, Nice. I saw that in your slack and thought it was funny timing based on @fubuloubu’s comment.
I don’t think standard deviation reveals anything maximum/minimum doesn’t from a security perspective. If there are pathologically complex paths, then the worst example is the one we care about/the most illustrative case. However, the
results list is just pairs of inputs and the gas they consume, so if you’re particularly interested in standard deviation regardless it should be pretty easy to add.
I was thinking that there might be certain groupings of execution paths that naturally form (say for different sized string inputs), but an outlier that is significantly (more than one std dev) outside of that grouping might have special characteristics that require investigation (a certain string input makes an external call).
This is way too hypothetical though, I think min/max is probably good enough to see interesting behaviors in action. The idea is to find abnormal behavior through gas usage.