Until a week ago, I was pretty much buried up to my eyeballs in software testing—bug hunting, really. This was part of coursework for school, but the prey was real enough, with the stalking grounds including a popular open-source database software implementation as well as a janky version of the card game Dominion. In my years of studying computer science, I’m not sure if I’ve ever been presented with a comparable combination of thrill and drudgery.
At the very least, I wound up with a new appreciation for the oft misunderstood realm of software testing. It turns out that testing involves so much more than just being annoying to software engineers—it’s an art of intentionally breaking things. Pushing and pushing code until it finally snaps and every fuck-up is revealed in its fucked up glory.
That’s the thrill. The drudgery is burying yourself in someone else’s code that is probably not as readable or well-documented as it should be, and then coming up with clever and exhaustive tests tailored to that code (fortunately many are or can be automated, but even then).
I’ve been looking for a way into writing about software testing (and debugging/bug hunting) for a little while, and this week’s announcement that the DoD is now shelling out big bucks in bug bounties seems as good as any. Bug hunting may sound like hacker sorcery, but no not really. It’s just another wing of software engineering.
Bugs, especially bugs that wind up in finished software products or web apps, are usually (always?) really hard to find. They may only manifest once in a million runs of some software, or only in the most extreme edge cases, e.g. those situations so unlikely to occur in the execution of some piece of code that they are almost but not quite impossible. This means we might have to run the software a million times or more to catch it.
But, if something bad is going to happen only once in a million runs, why are we even bothering? Because once in a million is in many cases a near-promise that something bad is in fact going to happen, maybe even many times. If our software is running, say, on one of the many, many computers under the hood of a modern car, that one in a million might be an airbag randomly deploying on a driver as they navigate rush-hour traffic. So, we have no choice but to look carefully and exhaustively at that one event.
Maybe you can see how this quickly gets tedious.
We start simply. The code is tested a few times with hand-picked inputs, just to see what happens. Probably, it executes as it should. This is called manual testing.
Next, things escalate as we go from manually testing some code to writing our own automated tests. Now it gets tricky. Our task is to provide as much coverage as possible in our tests, which means that we need to (ideally) ensure that every line of code in the program is executed. This means following the program through its every possible branch and deviation and error condition, however unlikely it is that such an execution will happen IRL.
Once in a million is in many cases a near-promise that something bad is in fact going to happen, maybe even many times.
To do this, we need to be very careful about what inputs we give the program. If a section of code can conceivably execute, then we need to figure out how to ensure that the program actually reaches that point. An interesting thing is that it might turn out that a particular piece of code within a program will never run, no matter what we do. It just exists as a weird little bubble that maybe the software developer never go around to implementing or is maybe an artifact from some earlier version of the software that’s no longer needed. It happens.
As tedious as it can be, an automated testing suite can also be a highly satisfying thing. The point is basically the redline the program—or a discrete subdivision of the program, as in unit testing—and it feels kind of good just starting at the screen and knowing that our tests are at that moment ripping through it at a rate of maybe billions of instructions per second. A short while later and we’ll have some results. Did it fail? Where? Why?
A branch of automated testing is randomized testing, and this is where things get intense. Randomization is how we can really dig into the musty corners of some code. We can make up inputs ourselves forever, but when dealing with millions or more possible combinations, it probably makes more sense to just build a software tester that can itself generate random inputs across the entire allowed ranges of those inputs (and beyond, actually, because we have to make sure the program fails correctly on those bad inputs, right?).
As an example, I recently had to figure out how to randomly test a function implemented in Java that is used to determine whether or not a URL is valid or not. The code doesn’t just get to, like, paste it into Chrome—instead, the function breaks the URL up into its different constituent parts and then compares each one of those parts against valid patterns, which are themselves derived from the actual URL definition given by the Internet Engineering Task Force circa 1994.
The problem was in coming up with not just random URLs, but also random URLs that I knew beforehand whether or not they were valid. My solution—or partial solution, really—was to just go out and grab a few thousand random but valid URLs from the internet, which I did using a Python script. I then came up with a few different ways of introducing random things into those URLs that would make them invalid, such as inserting a disallowed character or putting an extra slash where it wasn’t supposed to be.
So, the random tester created an input set by mixing valid URLs with URLs that it had purposefully, randomly fucked up. Crucially, the tester knew the difference beforehand so it could compare this prior knowledge against the URL checker’s output.
What I described above is more a canonical view of software testing than what bug bounty hunters are up to—“Bug Bounties 101” seems like a good place to start for that—but I wanted to give a sense of the problem, and most especially, what makes it an interesting and very hard problem. I’m not exactly debugging fly-by-wire systems on jet aircraft, but a boy can dream.
Source courtesy: Motherboard Vice