Designing for failure
We run a newsletter that is sent roughly once a week, with additional commentary, news about our upcoming open-source projects, and things happening at the company. You can signup at www.crashoverride.com.
Discussion and comments about this article can be found on LinkedIn here.
There is an appsec trend that I must confess has totally flown under my radar, and the more I look at it, the more I realize how important it is. We have two free open-source projects coming that are related to this observation. This is how I came to realize what I had been missing, so consider that I may have a bias.
Way back when, I was a consulting manager at ISS or Internet Security Systems. ISS were leaders in intrusion detection systems with their Real Secure product. My team did a lot of the deployments after the product had been sold. What was true back then, and in talking to people is still true today, is that while tools like IDSs and WAFs have blocking modes that can protect against attacks, almost nobody uses them. I remember one of the Fireeye guys once telling me that you had to have IPS as a feature in order to sell your tool, but it was never needed to deploy or operate it. Said another way, people want protection, but don’t need it, at least not to buy.
I suspect this is likely the result of compounded industry conditioning that “an ounce of prevention is better than a pound of cure” and trying to see around the corner, but the reality is that false positives, false negatives and PON, Plain Old Noise, all tell people that automated protection is just not ready or practically possible today. I have certainly heard phrases like “you don't want to stick something between you and your users that will inevitably go wrong and stop them using your system do you?”. Maybe the promise of application intrusion prevention was always jumping the shark ?
The devops world seems to have taken a different approach, never trying to build the golden egg of a system that automatically protects itself from system failures out of the box, but instead embraced observability as a principle. The three core tenets of observability, logs, metrics and traces are used to design resilient systems based on observing real-world behavior. Observability is so deep rooted in the devops culture, that there is a key role of Site Reliability Engineers or SREs on any cloud team. The engineers at SourceClear use to eat, sleep and breath Sentry alerts in the slack channel. Many things were fixed before any customer ever noticed.
If we look at the approach many people took to the Log4shell vulnerability, most people went into ‘scramble mode’, trying to find out where it was in production. The standard approach was to look at the results of their SCA tools. They inevitably then tried to filter the list of ‘critical alerts'' from the thousands of repos, to see what was actualy in production, only to realize that they didn’t have the data. A CSO friend told me the other day he has 6,300 repos and no clue how many are hackathon projects or copies of other applications. When teams eventually got the data, usually highly manually, a problem that our upcoming open source project called Chalk is automating away, they then had to get on the hosts to further filter through the code and file system to see if it was actually being used or not. It's amazing how many test directories with Log4J make it into production.
Instead of all of the investment in SCA tools and SAST, imagine a scenario where observability was as big an investment as assessment is. If logging, metrics and tracing for security had been enabled at the application layer, and the data fire-hosed back to an S3 bucket, it would have been a matter of a data query to see where and what version of log4J is actually running in production.
Designing for failure is such a no-brainer but I think many appsec programs have been totally focused on designing for success. I know I have been.