Sony Entertainment got hacked, all the executive emails were put on public display, showing what petty people they really are. Not to mention several movies going into the pubic domain. Our own government has demonstrated they can’t secure the names and social security numbers of their own employees. Target has been hacked. Aside from the security issues, companies frequently have down time. Even stock exchanges have had blackouts. Software bugs have even resulted in grid failures. Even a Mars mission failed because one part of the code was doing calculations in miles, another in kilometers. These are all code problems. How does this kind of crap happen? Doesn’t anyone check or test code? The answer is yes, all important computer code is tested. There are both automated testing and hands on testing by human carbon units. Even so, the pressure is always on to get the code out the door and into the real world as fast as possible. The money goes to the first mover, not the outfit with the tightest code.
This latest snafu caused me some amusement. The email component of Microsoft’s Office 365 has been offline for 9 days.
Imagine 9 days without email in your business. It’s true that the problem is not with every email account, but reports suggest some clients have large numbers of people negatively affected. All of these emails live in the cloud, so you have to wonder if it’s a code bug, or a security vulnerability.
A good part of the reason I find this amusing is that I worked for Microsoft for many years and saw this all the time. Microsoft and all the other companies that sell software have teams of people who work to resolve these kinds of problems. The fact that these companies have entire office building floors devoted to fixing customer problems and bugs tells you these kinds of failures are a daily occurrence. To be fair, the code base that runs our country is complex beyond human understanding, and is constantly changing, so it will never be perfect.
I’m going to relate a couple of stories, both of which I was personally involved in. These are not stories I heard about, this is the real deal.
The first was an issue with a government organization. As is often the case when an organization like this has a problem that turns out to be a bug in a Microsoft product, I could not get access to the machines which had the problem. They are on an internal network with no access to the internet, and the machines contain top secret information. The way you solve this kind of issue is to have the customer set up a test network that you can troubleshoot. You fix the test case, then the customer backs up all the effected data and we run the fix on the secure problem network.
The problem turned out to be a bug, which was fixed, and rolled out to the customer. But the customer had a new problem, terabytes of their data had been scrambled, it had to do with the way directories were named, based on a 64-bit time stamp. So the data was there, you just couldn’t figure out which directory it was in.
In those days, as a troubleshooter, all you had to do was fix the customer’s problem. This was before Steve Ballmer. We were given wide latitude to do what was required, but we also had to take responsibility. So I wrote a little program in C++ to walk through the directory structure and rename all the directories to what they have been previously. The faulty time stamp was not accessible through the user interface or APIs, it was not documented. Took a couple of days to write the code and test it. I had some help from the developers back in Seattle to make sure I was using these hidden function calls correctly. Then I gave to code to the customer, who tested it on their own test network. Finally, their servers were backed up, and my fix-it code was run. It was successful, and everyone was happy.
It turns out that Microsoft had some internal servers that had suffered the same fate. So some bright spark in the IT department heard about this fix, and downloaded a test version from the developer’s server and ran it on the internal network, this part is very important, without telling anyone. Good fortune was smiling that day, and the IT fool who did this got the most recent version. If he had been a few hours earlier, he would have picked up a faulty version, and totally corrupted any server it ran on. The point of all this, is that all the hard work by myself, and the developer who helped me, all the testing by myself, the developer and the customer, would have meant nothing because one person took a shortcut and didn’t tell anyone what he was doing.
A second case had to do with an app I developed on my own to fix data corruption issues on hard drives. It was a low level tool that could be used to edit data directly on a byte level. It could read and allow the user to make changes to all the hidden areas of the drive, partition tables, file system and other areas only used by the operating system. Obviously, making changes at that level means a person can easily render a hard drive unbootable, or render the data inaccessible. You could also erase data in multiple passes so even the CIA could never get it back. This app was only available to the programmers in Seattle and a Product Support person trained in the low level working of the file systems. It was decided at one point to release as a support product on a version of Windows that went to corporate customers. As such it was in the build tree for the operating system. One of the things about this app is that it reads data in bytes. Windows reads data in Words. In programming a Word is two bytes. A character you read on the screen prior to Windows NT was a byte. From Windows NT on, the same character took two bytes of info to display.
For a while, some parts of Windows were still compiled to read data in bytes and some in Unicode, which used Words which were two bytes. That’s all good, because in the header files for you code there is a directive to tell the compiler to which one to use. After a while, an executive decision was made to go 100% Unicode. That was a good decision, the right thing to do, so long as you let everyone who owned code on those servers know when the change was going to take effect. Which was not done. Instead, someone went through the headers and changed them. The end result was that if my app, DiskProbe, modified any important structures in the file system and wrote them back to disk, the data would be hopelessly corrupted.
I only found out about it a week before final release, because another person in Product Support had been doing some testing on their own, and wiped out a hard drive.
What followed was a week of rewriting code, compiling and testing by myself, then having some other support people testing in every situation we could think of. A lot of working late into the night. Had the problem not been discovered, completely by accident, at least a few corporate customers would have suffered data loss. Not a good thing.
Again, with all the testing and security procedures in place in ensuring the highest possible product quality, a single person taking a short cut and not thinking things through, can create a total disaster.
Looks to me like there is still a way to go before these kinds of problems are a thing of the past, Of course that doesn’t help the souls who have been without email for a week. Expect code problems to persist.
“It should be noted that no ethically-trained software engineer would ever consent to write a DestroyBaghdad procedure. Basic professional ethics would instead require him to write a DestroyCity procedure, to which Baghdad could be given as a parameter.” Nathaniel Borenstein